TOC HTML Generator in Perl - Build Table of Content for your web page

I've searched on Internet any program/script which would automatically generate me a TOC for my html code which I need for all pages of a new website: India customer care. See a sample page whose TOC has been generated using this script.

What I found on the internet

Codeproject HTML TOC Generator
This did not seem to work. Even though I correctly inserted "INSERT contents" it was giving some error. It only worked partially.
A perl TOC GeneratorBut with the following issues:
- Does not use DOM to parse or generate (means it just uses some tricks to read html)
- Did not work when I tried
- Had listed it's own limitations
one zzee company selling TOC generator costing $20. Since it's an executable, forget customizing it.

My Solution

My Needs

Each of my web page will consist of only single H1 tag which will be same as page title because of SEO reasons. Remaining of my page will consist of H2-H7 tags and I want a TOC generated for the page for those H2-H7 tags. I don't want any TOC generator which automatically replaces any previous TOC, since if I'd added any customizations it'd lose that.

My Perl Script TOC Generator

Limitations

It'd ignore any H1 tags in the HTML page.
The h2-h7 titles must be only in the following fomat:
- <h2>Title abc</h2>
- <h2><a ...>Title abc</h2>
- It must not be in the format: <h2>Title<a ..>abc</a></h2>. That is h2 must not contain any html tag inside except of single a tag which must enclose full h2 text and not partially.

How to run it

Change the $filename to point to the desired file

How it works

Please see it does not modify the input file but prints out the new file content! It always dumps TOC in the beginning and remaining web page with inserted name tags without touching anything else. TOC contains nested UL's depending on H2-H7 nesting. So you'll need to copy paste the TOC html code in the desired location.

If it's a new code never generated by this script
- It generates TOC followed by web page with name= tags inserted within h tags
If it's already processed code by this script
- Beyond what it does above it, updates/deletes the old name tags.

Testing

I've tested with CSE HTML Validator and it's generating correct syntax code. Pl. see that my test code starts with a div tag only and not <html> since my site is on a CMS.

Does not work with your HTML code?

It happened with me then I found out that the input html file was not in correct syntax, one closing div was missing.

How Output TOC looks(I'm providing a sample!)

Code

#!/usr/bin/perl -w

#Copyright anurag gupta ; free to use under GNU GPL License

use strict;

use feature "switch";

use Common;

use HTML::Element;

use HTML::TreeBuilder;

#"F:/anurag/work/indiacustomercare/airtel/recharge.html";

my $filename="F:/tmp/t9.html";

my $index=0;

my $labelprefix="anu555ltg-";

my $tocIndex=100001;

my $toc;

my @stack;

my $prevHtag="h2";

sub hTagEncountered($)

{

my $hTag=shift;

my $currLevel=(split //, $hTag)[1];

given($hTag)

{

when(/h1/)

{

break;

}

default{

my $countCurr= (split /h/,$hTag)[1];

my $countPrev= (split /h/,$prevHtag)[1];

if($countCurr>$countPrev)

{

push @stack,($currLevel);

$toc.="<ul>";

}

elsif($countCurr<$countPrev)

{

# Now check in the stack

while ( @stack and $currLevel < $stack[$#stack])

{

pop @stack;

$toc.="</ul>";

}

$prevHtag=$hTag;

}

sub getLabel

{

my $name=$labelprefix.++$tocIndex;

}

sub traversehtml

{

my $node=$_[0];

# $node->dump();

# print "-----------------\n";

# print $node->tag()."\n";

# print ref($node),"->\n";

if((ref(\$node) ne "SCALAR" )and ($node->tag() =~m/^h[2-7]$/i)) #it's an H Element!

{

my @h = $node->content_list();

if(@h==1 and ref(\$h[0]) eq "SCALAR") #H1 contains simple string and nothing else

{

hTagEncountered($node->tag());

my $label=getLabel();

my $a = HTML::Element->new('a', name => $label);

my $text=$node->as_trimmed_text();

$a->push_content($text);

$node->delete_content();

$text=HTML::Entities::encode_entities($text);

$node->push_content($a);

$toc.=<<EOF;

<li><a href="#$label">$text</a>

EOF

}

elsif ( @h==1 and ($h[0]->tag() eq "a")) # <h1><a href="abc.com">ttt</a></h1> case

{

#See if any previous label already exists

my $prevlabel = $h[0]->attr("name");

$h[0]->attr("name",undef) if(defined($prevlabel) and $prevlabel=~m/$labelprefix/); #delete previous name tag if any

#set the new label

my $label=getLabel();

$h[0]->attr("name",$label);

hTagEncountered($node->tag());

my $text=HTML::Entities::encode_entities($node->as_trimmed_text());

$toc.=<<EOF;

<li><a href="#$label">$text</a>

EOF

}

elsif (@h>1) #<h1>some text here<a href="abc.com">ttt</a></h1> case

{

die "h1 must not contain any html elements";

}

my @h = $node->content_list();

foreach my $item (@h)

{

if(ref(\$item) ne "SCALAR") {traversehtml($item); } #skip scalar items

}

die "File $filename not found" if !-r $filename;

my $tree = HTML::TreeBuilder->new();

$tree->parse_file($filename);

my @h = $tree->content_list();

traversehtml($h[1]);

while(pop @stack)

{

$toc.="</ul>";

}

$toc="<ul>$toc</ul>";

print qq{<div id="icctoc"><h2>TOC</h2>$toc</div>};

my @list1=$tree->content_list();

my @list2=$list1[1]->content_list();

for(my $i=0;$i<@list2;++$i){

if(ref(\$list2[$i]) eq "SCALAR")

{

print $list2[$i]

}

else{

print $list2[$i]->as_HTML();

}

# Finally:

Saturday, July 16, 2011

TOC HTML Generator in Perl - Build Table of Content for your web page

What I found on the internet

My Solution

My Needs

My Perl Script TOC Generator

Limitations

How to run it

How it works

Testing

Does not work with your HTML code?

How Output TOC looks(I'm providing a sample!)

Code

#!/usr/bin/perl -w

#Copyright anurag gupta ; free to use under GNU GPL License

use strict;

use feature "switch";

use Common;

use HTML::Element;

use HTML::TreeBuilder;

#"F:/anurag/work/indiacustomercare/airtel/recharge.html";

my $filename="F:/tmp/t9.html";

my $index=0;

my $labelprefix="anu555ltg-";

my $tocIndex=100001;

my $toc;

my @stack;

my $prevHtag="h2";

sub hTagEncountered($)

{

my $hTag=shift;

my $currLevel=(split //, $hTag)[1];

given($hTag)

{

when(/h1/)

{

break;

}

default{

my $countCurr= (split /h/,$hTag)[1];

my $countPrev= (split /h/,$prevHtag)[1];

if($countCurr>$countPrev)

{

push @stack,($currLevel);

$toc.="<ul>";

}

elsif($countCurr<$countPrev)

{

# Now check in the stack

while ( @stack and $currLevel < $stack[$#stack])

{

pop @stack;

$toc.="</ul>";

}

}

}

}

$prevHtag=$hTag;

}

sub getLabel

{

my $name=$labelprefix.++$tocIndex;

}

sub traversehtml

{

my $node=$_[0];

# $node->dump();

# print "-----------------\n";

# print $node->tag()."\n";

# print ref($node),"->\n";

if((ref(\$node) ne "SCALAR" )and ($node->tag() =~m/^h[2-7]$/i)) #it's an H Element!

{

my @h = $node->content_list();

if(@h==1 and ref(\$h[0]) eq "SCALAR") #H1 contains simple string and nothing else

{

hTagEncountered($node->tag());

my $label=getLabel();

my $a = HTML::Element->new('a', name => $label);

my $text=$node->as_trimmed_text();

$a->push_content($text);