Saturday, June 29, 2013

Crawling a Website with VIEWSTATE & EVENTVALIDATION using PHP

It was really a tough job when I tried to make automated requests to a site using VIEWSTATE & VIEWARGUMENTS which is using I think ASP .Net. To make a request to any page, we need to send with the request the __VIEWSTATE, __VIEWARGUMENT hidden values along with the next request. This will change with every page you get it.

In short unlike HTTP stateless protocol, these two variables make remember the cgi-script of the previous state that client must have been to fetch data of current state.

You should also send all POST hidden variables too otherwise the request could fail.

There is one useful function 'exfield' which will extract any passed hidden field's value and there is 'sendpost' method which will send to the url all the arguments in the passed array using POST method and return you the result.

I hope it is of some help to you!

<?php
define('URL', 'http://www.example.com');  //The website using .Net ASP

  define('PAT_RESULTS_FOUND', '/Search result:([0-9]+) Results found/');

  define('TOTAL_officeofficeS_IN_PAGE', 10); //max Select01 like links in a page

  //................................................................

  $total_internet_requests = 0;

  assert_options(ASSERT_CALLBACK, 'my_assert_handler');

$pppcodearr = array('1', '2');

  foreach ($pppcodearr as $pppcode) {

  $not_found_file = $pppcode . "not-found";

  

  if(file_exists($not_found_file))

  {

  continue; //skip this ppp-code

  }

  

  $content = sendget(); //the very first page

  $fields = array(

  'ddl_dist' => '341',

  'ddl_state' => '1',

  'hdn_tabchoice' => '1',

  'search_on' => 'Search',

  'txt_dist_on' => '',

  'txt_offname' => $pppcode,

  //'__EVENTARGUMENT' => 'Page$3',

  '__EVENTTARGET' => 'ggg',

  'txt_stateon' => '',

  );
 $exflds = array('__VIEWSTATE', '__EVENTVALIDATION');

  foreach ($exflds as $val) {

  $fields[$val] = exfield($val, $content);

  }

  $fields['__VIEWSTATEENCRYPTED'] = '';
 $content = sendpost($fields);   //this is Page$1
 assert(checkvalidpage($content));

  $total_recs = get_total_results_found($content);

  

  

  if($total_recs == 0)

  {

  //indicate no records for this pppcode

  assert(file_put_contents($not_found_file, ''));

  continue;

  }

  

  $total_pages = ceil($total_recs / TOTAL_officeofficeS_IN_PAGE);
 $page_no = 1;
 $post_offices_in_page = $post_offices_in_page_remaining = officeofficesinpage($content);


 //if it is the first page then check if all records have already been downloaded

  $total_recs_ctr = $total_recs;

  $not_exists = false;

  for ($pg = 1; $pg <= $total_pages; ++$pg) {

  $sel = -1;

  do {

  ++$sel;

  --$total_recs_ctr;
 $file = coin_ppprecord_filename($pppcode, $pg, $sel);
 if (file_exists($file) && checkvalidpage(file_get_contents($file))) {

  //skip it

  if (dbg()) {

  print "$file already exists ... skipping\n";

  }

  } else {

  $not_exists = true;

  break 2;

  }

  } while ($total_recs_ctr && $sel < TOTAL_officeofficeS_IN_PAGE - 1);

  }
 if ($not_exists) //if at least 1 records does not exist then only enter this loop.

  do {

  //this the Page$1

  wrt($content);
 if (!checkvalidpage($content)) {

  break;

  }
 $fields = array(

  'ddl_dist' => '0',

  'ddl_state' => '1',

  'hdn_tabchoice' => '1',

  'txt_dist_on' => '',

  'txt_offname' => $pppcode,

  '__EVENTARGUMENT' => 'Select$0',

  '__EVENTTARGET' => 'ggg',

  'txt_stateon' => '',

  '__VIEWSTATEENCRYPTED' => '',

  );
 foreach ($exflds as $val) {

  $fields[$val] = exfield($val, $content);

  //print "$val= $fields[$val] \n";

  }
 for ($sel = 0; $post_offices_in_page_remaining--; ++$sel) {

  $fields['__EVENTARGUMENT'] = 'Select$' . $sel;
 $file = coin_ppprecord_filename($pppcode, $page_no, $sel);
 if (file_exists($file) && checkvalidpage(file_get_contents($file))) {

  //skip it

  if (dbg()) {

  print "$file already exists ... skipping\n";

  }

  } else {

  $result = sendpost($fields);

  if (checkvalidpage($result)) {

  assert(file_put_contents($file, $result));

  } else {

  print "Is not valid page found for $file\n";

  print " $sel < $post_offices_in_page $page_no\n";

  assert(true);

  }

  }

  }

  //go over to the next page

  ++$page_no;
 $fields = array(

  'ddl_dist' => '0',

  'ddl_state' => '1',

  'hdn_tabchoice' => '1',

  'txt_dist_on' => '',

  'txt_offname' => $pppcode,

  '__EVENTARGUMENT' => getpageno($page_no),

  '__EVENTTARGET' => 'ggg',

  'txt_stateon' => '',

  '__VIEWSTATEENCRYPTED' => '',

  );
 foreach ($exflds as $val) {

  $fields[$val] = exfield($val, $content);

  }

  $content = sendpost($fields);

  } while ($page_no <= $total_pages);

  }//for each


print "Total internet page requests = $total_internet_requests\n";
function dbg() {

  return 1;

  }
function my_assert_handler($file, $line, $code) {

  echo "<hr>Assertion Failed:

  File '$file'<br />

  Line '$line'<br />

  Code '$code'<br /><hr />";
 var_dump(debug_backtrace());

  exit(1);

  }
function get_total_results_found($content) {

  if (strstr($content, 'No Matched Post offices found')) {

  return 0;

  } else if (preg_match(PAT_RESULTS_FOUND, $content, $matches)) {

  if (dbg()) {

  print "total pppcode results=$matches[1]\n";

  }

  return $matches[1];

  } else {

  assert(true); //can't reach here

  }

  }
//count number of officeoffice link in the page

  function officeofficesinpage($content) {

  //The look like javascript:__doPostBack(&#39;ggg&#39;,&#39;Select$[0-9]{1,2}
 $pat = '/javascript:__doPostBack\(&#39;ggg&#39;,&#39;Select\$[0-9]{1,2}/';
 wrt($content);
 $ret = preg_match_all($pat, $content, $matches);
 assert($ret !== FALSE);
 return $ret;

  }
function getpageno($page) {

  return 'Page$' . $page;

  }
function getselno($sel) {

  return 'Select$' . $sel;

  }
function checkvalidpage($content) {

  if (strlen($content) < 65000 || strstr($content, 'Sorry this site has encountered a serious problem, please try reloading the page')) {

  return false;

  } else {

  return true;

  }

  }
//extract value of a hidden field

  function exfield($field, $content) {

  $pat = '{<input\s+type="hidden"\s+name="' . $field . '".*?value="([^"]+)"}';
 if (preg_match($pat, $content, $match)) {

  return $match[1];

  } else {

  print("Unable to extract $field\n");

  }

  }
function wrt($content) {

  file_put_contents("F:/tmp/a.htm", $content);

  }
function sendget() {

  global $total_internet_requests;

  $ch = curl_init(URL);

  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

  curl_setopt($ch, CURLOPT_HEADER, 0);

  $txResult = curl_exec($ch);

  $statuscode = curl_getinfo($ch, CURLINFO_HTTP_CODE);

  

  ++$total_internet_requests;
 if (dbg() >= 2) {

  print "statuscode=$statuscode\n";

  print "Result=$txResult\n";

  }

  assert(file_put_contents("F:/tmp/abc.htm", $txResult) !== FALSE);

  curl_close($ch);

  return $txResult;

  }
function sendpost($postarr) {

  global $total_internet_requests;

  $data = '';

  foreach ($postarr as $key => $val) {

  $unit = "$key=" . urlencode($val);

  if (strlen($data) == 0) {

  $amp = '';

  } else {

  $amp = '&';

  }
 $data .= "$amp$unit";

  }
 $custom_headers = array();

  $custom_headers[] = "Accept: text/html, application/xhtml+xml, application / xml;q=0.9, */* ;q=0.8";

  $custom_headers[] = "Pragma: no-cache";

  $custom_headers[] = "Cache-Control: no-cache";

  $custom_headers[] = "Accept-Language: en-us;q=0.7,en;q=0.3";

  $custom_headers[] = "Accept-Charset: utf-8,windows-1251;q=0.7,*;q=0.7";

  $ch = curl_init();

  $useragent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1";

  curl_setopt($ch, CURLOPT_USERAGENT, $useragent); // set user agent

  curl_setopt($ch, CURLOPT_URL, URL);
 if (strlen($data)) {

  curl_setopt($ch, CURLOPT_POSTFIELDS, $data);

  curl_setopt($ch, CURLOPT_POST, 1);

  }

  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);

  curl_setopt($ch, CURLOPT_HEADER, false);

  curl_setopt($ch, CURLOPT_HTTPHEADER, $custom_headers);
 curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 20);

  curl_setopt($ch, CURLOPT_TIMEOUT, 40); //timeout in seconds
 $txResult = curl_exec($ch);

  

  ++$total_internet_requests;
 $statuscode = curl_getinfo($ch, CURLINFO_HTTP_CODE);

  curl_close($ch);

  if (dbg() >=2 ) {

  print "statuscode=$statuscode\n";

  print "Result=$txResult\n";

  }

  if (dbg()) {

  assert(file_put_contents(tempnam(get_temp_dir() . "pppcode", "post_req"), $txResult) !== FALSE);

  }

  return $txResult;

  }
function get_temp_dir() {

  return "f:/tmp/";

  }
function coin_ppprecord_filename($pppcode, $page_no, $sel) {

  return get_temp_dir() . "pppcode/" . "$pppcode-" . getpageno($page_no) . '-' . getselno($sel) . ".htm";

  }

Friday, October 5, 2012

Send email with Perl using SMTP with login/password authorization

Here is the working Perl script to send email using SMTP with login/password authorization.

It'll work if you run from your PC or webhosting server.


#!/usr/bin/perl
use Net::SMTP;
$smtp = Net::SMTP->new('mail.d********ia.com');
  $smtp->auth( 'conta******dia.com', 'passw***' );
  $smtp->mail('contactus@do***********.com');
  $smtp->to('ra******@gmail.com');
$smtp->auth( 'conta********dia.com', 'passw***' );
  $smtp->data();
  $smtp->datasend('ra******@gmail.com');
  $smtp->datasend("From: Whats up?\n");
  $smtp->datasend("To: Whats up?\n");
  $smtp->datasend('Subject: Whats up1?');
  $smtp->datasend("\n");
  $smtp->datasend("See, I mailed you through my program.\n");
  $smtp->dataend();
$smtp->quit;

Monitor File Changes and Send Email Notification Script in PHP

Are you looking for a php program which will monitor a set of files and send email notifications in case of file size changes?

Just use this code, make it run once every hour or so (using crontab on Unix).

I wrote this to monitor sites' .htaccess files in case of hacking attack and alert me about it.

Please read the comments in the beginning - it is self explanatory. My private names/folders/password have been replaced with **** in this.



<?php
  /**
  * @file 
  * This program will monitor changes in .htaccess files in hosting account. Urgent notification will be sent in case
  * of changes immediately. Also daily once( between 12-1 am) one DAILY notification will also be sent along with changes.txt file if present
  * else without this file. If you don't receive the email next day morning then immediately check for changes.txt and why you
  * did not get the notification may be because the program did not run.
  * If any file is missing then no email is sent but it is logged in changes.txt
  * 
  * In the code below change the timezone in date_default_timezone_set to your area's
  * 
  * Every time this runs, it moves the old *.sr file to *-old.sr
  * 
  * BEFORE YOU MAKE CHANGES TO .HTACCESS(to prevent URGENT notification being generated)
  * - Check for changes.txt for any messages, delete it( if messages are of no use)
  * - delete *.sr file containing last history of file sizes (not recommended). If you don't delete it then one URGENT notification email
  *   will be sent
  * 
  * CHANGES.TXT
  * - All the information keeps on getting added to this file till you delete it
  * - Safely delete it only after carefully checking all the messages and that no URGENT messages are there.
  * 
  * HOW IT WORKS
  * - Maintain a list of .httaccess to monitor; this can be got from:
  *     -# cd your-home-directory
  *     -# find . -name '.htaccess' -print   //copy paste the output
  * - Change your home folder in the $home variable
  * - Put your email id(s) in $all_emails array
  * - Replace email server, username and password
  * - place an entry in crontab to run it once every hour
  */
///////////////////////////////////////////////////////////////////
  //CHANGE ONLY THESE ENTRIES
  ///////////////////////////////////////////////////////////////////
$workingdirectory = "/home/****/www/tmp/all/hacked"; ///< The dir where it'll create files
$home = '/home/*****/www'; ///<The folder to prefix to the names of relative path files in $files
$all_emails = array(
  "g*****m@gmail.com",
  "d****@gmail.com"
  );  //List of all email where you want notifications to be sent
 $code = "XYLL394"; ///< Use this code in your gmail/email filter for this change notification file as it's body will contain it
  $from = "contactus@dow*******dia.com"; ///<Sender email id
  $host = "mail.dow******dia.com"; ///<Email server name
  $username = "contac******sindia.com"; ///<Login name
  $password = "*******"; ///<password
  
  /*YOUR CHANGES STOP HERE*/
  ///////////////////////////////////////////////////////////////////
  $SRFILE = 'htaccess-hacked.sr'; ///<Name of the serialized file which will hold past sizes and other info of all files in $files array
  $changesFile = "changes.txt"; ///< All changes will be logged in this
  
  ///////////////////////////////////////////////////////////////////
  
  require_once "Mail.php";
  require_once('Mail/mime.php');
error_reporting(E_ALL | E_NOTICE | E_STRICT);
if(!chdir($workingdirectory))
  {
  print "chdir to $workingdirectory failed";
  }
date_default_timezone_set('Asia/Kolkata');
$files = array(
  'C:\xampp\tmp/.htaccess',
  'forum.********.org/.htaccess',
  '.htaccess',
  'tmp/all/hacked/.htaccess',
  ); ///< List of relative paths of all files which need to be monitored.

//get the last serialized file
if (file_exists($SRFILE)) {
  $lastinfo = unserialize(file_get_contents($SRFILE));
  assert(copy($SRFILE, "$SRFILE-old"));
  } else {
  $lastinfo = null;
  }
$outarr = array();
$text = '';   ///<It contains files whose sizes have changed
  $othertext = ''; ///< It contains files who don't exist any more
//read the current size
  $index = 0;  ///<This counts changed files info in $text
  $index2 = 0; ///<This counts unexistings files info in $othertext
foreach ($files as $file) {
  
  $fullpath = "$home/$file";
 chmod($fullpath, 0555);
 if (file_exists($fullpath)) {
  $outarray[$file] = filesize($fullpath);
 //check if same as last one
  if (!is_null($lastinfo) and (!array_key_exists($file, $lastinfo) or $lastinfo[$file] != $outarray[$file])) {
  ++$index;
  $text = "$index) $file\n$text";
  }
  }
  else //does not exist; put this information in file only; send no urgent emails
  {
  ++$index2;
  $msg = "$file not found";
  $othertext = "$index2)$msg\n$othertext";
  }
  }
//now write the array
  assert(file_put_contents($SRFILE, serialize($outarray)));
//send the email if any changes
  if (strlen(trim($text)) or strlen(trim($othertext))) {  ///< send notification only when file size changes and not for missing files
  //write in the tmpfile
 $fp = fopen("$changesFile", "a");
  assert($fp);
  fprintf($fp, "------------------------------------------------------\n");
  fprintf($fp, "\nToday's date :%s\n", date('r'));
  
  
  if(strlen(trim($othertext)))
  {
  fprintf($fp, "\n\tUN-EXISTING FILES FOLLOW:\n%s\n", $othertext);
  }
  
  if(strlen(trim($text)))
  {
  fprintf($fp, "\n\tLIST OF CHANGED FILES FOLLOW:\n%s\n", $text);
  fclose($fp);
  send_email($changesFile, "URGENT NOTIFICATION(Websites Hacked!): .htaccess changed"); //send urgent notification 
  }
  
  
  } else {
  
  print "no changed .htaccess files found\n";
 //if the file exists then send it once in a day at night
  //$hh = date('H'); //get the hour
  $filename = date('d'); //just to remind that we've sent the file
  if ( !file_exists($filename)) {
  assert(file_put_contents($filename, "empty"));
  
  $chfile = null;
 if(file_exists($changesFile))
  {
  $chfile = $changesFile;
  }
  
  send_email($chfile, "DAILY Notification on: .htaccess change");
 //remove yesterday's file
  $yestday = date('d', strtotime('-1 day'));
 assert($filename !== $yestday);
 unlink($yestday);
  }
  }
function send_email($file, $sub) {
  
  global $all_emails,  $host, $username, $password,$code,$from; 
//    if (!file_exists($file)) {
  //        print "Sending no emails as no input file: $file found\n";
  //        return;
  //    }
 $dateinfo = getdate();
 $to = $all_emails[0];
  
  for($j=1;$j< count($all_emails);++$j)  //append all the emails
  {
  $to = "$to , $all_emails[$j]";
  }
 $subject = $sub;
  $body = "Change notifications for file; code:$code";
  $html = null;
 $headers = array('From' => $from, 'To' => $to, 'Subject' => $subject);
  
  error_reporting(E_ALL);
 $crlf = "\n";
 $mime = new Mail_mime($crlf);
  $mime->setTXTBody($text);
  $mime->setHTMLBody($body);
  
  if($file)
  {
  $mime->addAttachment($file, 'application/octet-stream');
  }
  //do not ever try to call these lines in reverse order
  $body = $mime->get();
  $headers = $mime->headers($headers);
 $smtp = Mail::factory('smtp', array('host' => $host,
  'auth' => true,
  'username' => $username,
  'password' => $password));
 $mail = $smtp->send($to, $headers, $body);
 if (PEAR::isError($mail)) {
  echo("
" . $mail->getMessage() . "
");
  } else {
  echo("Message Sent successfully 
  Thank you.
  Please Visit Again!
  ");
  }
  }

Sunday, September 30, 2012

Extracting urls from Google search result pages in html(Perl script)

If you want to extract the urls of the site you give in site: in Google search from the html code of the SERP ( google search results pages) then here is a simple Perl script to do just that:

use strict;
use feature "switch";



my $file ="f:/tmp/v1.htm";
my $url_starts_with="http://www.creditcardpaymentgateways.in/2012.php";  

my $content;
{
 local $/;
 
 open FP, "<$file" or die "Can't open $file for reading";
 
 $content = <FP>;
 
 close FP;
}

my %map;
while(
 $content =~ m#($url_starts_with[^"]+)#gsi) 
    
{
 $map{$1}=1;
}

map { print "$_\n"; } sort keys %map;

Thursday, August 30, 2012

Converting Edifact to XML Using Java - Sample Implementation

Here is a sample implementation of Java implementation of converting from Edifact format to XML and vice versa. Conversion from XML to Edifact is straight forward while the opposite is not. Great thing is that this code will work irrespective of how many messages are there at any level.
You can download(for educational purpose only) this implementation here.
For comments pl email me at ddn.job --at-the-rate g-mail.com.

Here is the description of how the attachment is composed and code created:

-------------------------------------------------------------------
Directory: DB
This contains formatted file containing necessary information for converting from Edifact to XML
and vice versa
CODELT: Coded data elements
COMPT: Composite data elements
ELEMT: Simple elements
SEGMT: Segment descriptions
These have been generated by the perl formatting programs in PerlDBCreate. These files' contents are from
UN 2001B directory and from the 2001B service directory and code lists. These need to be downloaded
separately from the unece.org and gefeg.com websites for PerlDBCreate inputs.
---------------------------------------------------------------
Directory: PerlDBCreate
This contains perl formatting programs especially changed for Y2001B directories.
For different directories create new programs in new directory and let the main perl program
BuildEdifactDB.pl call the programs in it.
When BuildEdifactDB.pl is run it creates the files in DB directory which will act as input
for the programs in Parser directory.
---------------------------------------------------------------------
Directory: Parser
This contains all the source to:
1) Load edifact infomation in data structures (src\BuildEdifactDirectory)
2) Format Edifact to XML (src)
A) Parse the input Edifact document
B) Convert into XML
3) Format XML to Edifact (src\XMLtoEdifactConversion)
For Format Edifact to XML (2) the starting class is Main.java.
It first defines the Edifact message structure in terms of trees/nodes and child/parents.
Please be very careful because debugging will be difficult if any link is invalid or broken.
The it calls BuildEdifactDirectory to load the Edifact data structures.
Next it reads the inputs file and parses it with the parser. If unexpected input is seen then
it emits an error.
It should be noted that the output XML document looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<segment name='interchange.header'>
<composite name='syntax.identifier'>
<element name='syntax.identifier'>UNOA</element>
<element name='syntax.version.number'>3</element>
</composite>
<composite name='interchange.sender'>
<element name='interchange.sender.identification'>5400141000009</element>
<element name='identification.code.qualifier' uncl="Not Found">14</element>
</composite>
........
If the element is UNCL/UNSL the uncl attribute will contain the code list description.
If not description is found then it'll contain "Not Found"
Although all odd elements are coded but I'm not sure whether the code list description of
all such odd elements will be found. I've tried to ask this question from UN but they could not
clarify it.
Groups always start with <group> and end with </group> tag
Also I've taken care to convert all the xml reserved characters to & equivalents so that
no problem will arise.
-------------------------------------------------------------------------------
Directory: Parser/XML
Contains the XML DTD and XSD of output xml file generated by parser for validation.
-------------------------------------------------------------------------------
Directory: "Edifact e-mail conversations"
These mail exchanges contain some of my Edifact queries and answers if any. At this time of leaving I'm clear about
90% of Edifact.
-------------------------------------------------------------------------------
Directory : EDI_Sample
This I'd received from Abhijit and I was running my code on this input only. I've already created the tree
structure for this input only. However these input/doc are not correct and are partial only.
The word doc describing the *.edi input file does not correlate with the input. So correct file needs to be given.
----------------------------------------------------------------------------------
Directory: SampleOutput
Contains the output generated from the input *.edi file present in the EDI_Sample directory mentioned above.
Just like input edifact file, this file is unformatted one. This software will not generated formatted xml because
when formatting preceding/trailing blank spaces may be added/removed. But Edifact allows only trailing blanks spaces
to be removed. This will create problem when converting formatted XML document back to Edifact.
-----------------------------------------------------------------------------------
File: Parser\src\BuildEdifactDirectory\EdifactConstants.java
No constant has been hardcoded except for in EdifactConstants.java. For example if non default segment separator is
used in Edifact then it must be changed in this file. All hard codings must be done only in this file.

---------------------------------------------------------------------------------------
Process to Run ( Only for Edifact to XML)
1) Download the required Edifact directories
2) Create a new directory( say 2002C) name in PerlDBCreate and copy all the programs in 2001B
3) Change all the programs so that all the elementes are correctly extracted in all the files as
discussed in DB directory above
4) Create the files as in DB directory and update their parent directory in EdifactConstants.DATABASE_PATH
constant
4) Create correct parse tree describing the Edifact format in Main.java. Note that we can describe
any sort of tree so the edifact file can contain even multiple messages to any level. However groups
in Edifact file must be implicit(default) ones as in the sample input file given. Explicit group
handling needs to be added.
5) Review EdifactConstants.java file for Edifact constants
6) Change the input Edifact file path in EdifactFile class
7) Execute Main.java
---------------------------------------------------------------------------------------
Other Issues
1) This software is not optimized for production use. For example there are some place where I've created new String where
StringBuffer should have been used for faster manipulation.
2) Also the MessageParser's errors when non-matching input needs to be made useful/meaningful messages.
3) This software in no way does any sort of Edifact format-verification.
---------------------------------------------------------------------------------------

SUGGESTIONS
1) Use XMLSpy/XMLPad etc to xml editing and manipulation
2) For generating PDF documents use the only standard XML-FO and don't use the non-standard itext library.
3) For xml-fo formatter download the free apache executable

Tuesday, May 15, 2012

How to Get Forum Change Alerts by Email

I've a couple of forums(Mybb and PHPBB) but for new forums we are always quite eager to respond/moderate new posts. I searched all options so that in case of new posts I get notified automatically. I even tried http://www.Watchthatpage.com but did not work out.
I was wasting lot of time in visiting my forums for new posts. Ultimately I've written a php script which is executed by crontab every hour on my web hosting. This is a crude custom made script and I'm sure it'll not work with your forum. But you can make similar to this. It's a grate relief for me now! I get an email alert whenever there is new post!

How I'm doing is after reading the html code of page, I trim some content based on regular expression. Then I convert to text using html2text class and again I trim some content based on regular expression.
We need to remove any current times at least.

Download class.html2text.inc from here.
License : As per GNU GPL.

<?php
if(!checkruntime()) //don't run in the night times
{
exit(0); //not a time to run
}
require_once('class.html2text.inc');
define ('MYBB','mybb');
define ('PG', 'pg');
define ('EC', 'emile-coue');
$sites = array(
PG => array("creditcardpaymentgateways.in"),
MYBB => array("indianworkingwoman.org","indiaconsumercomplaints.org"),
EC => array("emile-coue.org"),
);
//Applied in html. This part is RETAINED
$reghtmlRetained = array(
EC => array('{<table.*?</table>}si'),
);
//This is applied in the text. This part is removed
$reg = array(

MYBB => array('/Current time:.*?\b[AP]M\b/i',
'/^.*Latest posts\s+Topic/s',
'/\s+Most views\s+.*/s',
'/[0-9][0-9][:-][0-9][0-9]/',
'/Today|Yesterday/',
),
PG=> array('/It is currently.*?\b[ap]m\b/i',
//'/[?]sid=[0-9a-z]+/i',
'/sid=[0-9a-z]+(#p[0-9]+)?/i',
'/[*] Delete all board cookies .*/s'

),
EC=> array('/It is currently.*?\b[ap]m\b/i',
//'/[?]sid=[0-9a-z]+/i',
'/sid=[0-9a-z]+(#p[0-9]+)?/i',
'/\s+Who Is Online\s+.*/s',
'/Discuss about any auto suggestion methods, their comparison.*/s',

)
);

chdir('/home/premg/www/nov/all/forumcomp');
foreach($sites as $key =>$sitearr)
{
foreach($sitearr as $site)
{
$site1 = "http://forum.$site";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$site1);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$data = curl_exec($ch);
curl_close($ch);
$content = $data;

assert(strlen($content)>500);

if(isset($reghtmlRetained[$key]))
{
foreach($reghtmlRetained[$key] as $pat)
{
//print "$content\n";
$matches = array();
$ret = preg_match_all($pat,$content,$matches,PREG_SET_ORDER);
assert($ret > 0);

$content = "";

for($i=0;$i<$ret;++$i)
{
$content .= $matches[$i][0];
}
}
}

$h2t = new html2text($content);
$content = $h2t->get_text();

if(isset($reg[$key]))
{
foreach($reg[$key] as $pat)
{
$content = preg_replace($pat,'',$content);
// print "new pat strlen=".strlen($content)."\n";
}
}

//get the old content
$oldtext = file_get_contents($site);

if($oldtext != $content) //send email
{
send_email('preggup189456@gmail.com', $site);
//now write this new content

//rename the old one to .old
$tofile = "$site.old.txt";
unlink($tofile);
rename("$site", $tofile );

assert(file_put_contents($site, $content));

}

}

}
function send_email($email,$site)
{
print "sending email for $site ...\n";
$subject = "Content $site changed";
$body = "Click <a href=\"http://forum.$site\">here</a> to see.";
mail($email,$subject,$body);
return;
}

function checkruntime()
{
$dt = (gmdate("Hi"));
$hr = (int)($dt/100);
$min = gmdate("Hi") - $hr*100;

$hr += $min/60.0;

//convert to indian time
$hr += 5.5;

if($hr > 23)
{
$hr = $hr - 24;
}

if( ($hr >= 0 and $hr < 7) or (int)$hr == 23 )
{
return false;
}
else
{
return true;
}
}

Tuesday, August 2, 2011

Copy Directory Structure using Perl utility

Do you have a directory and want to create a replica of the directory structure but without it's file contents? Then this is the utility that let you do it.

#!/usr/bin/perl -w
use strict;

use File::Find;
use File::Path;

my %files;

#CORRECT USAGE
#my $ROOT_DIR="F:/NOT TO BE BACKED UP";
#my $RELATIVE_PATH_UNDER_ROOT_DIR="INSTALL"; #That is I want the directory structure of $ROOT_DIR/$RELATIVE_PATH_UNDER_ROOT_DIR created new at $NEW_ROOT/$RELATIVE_PATH_UNDER_ROOT_DIR
#my $NEW_ROOT="F:/Anurag";

my $ROOT_DIR="F:/STATIC/Anurag";
my $RELATIVE_PATH_UNDER_ROOT_DIR="personal"; #That is I want the directory structure of $ROOT_DIR/$RELATIVE_PATH_UNDER_ROOT_DIR created new at $NEW_ROOT/$RELATIVE_PATH_UNDER_ROOT_DIR
my $NEW_ROOT="F:/Anurag";

sub mySub
{

my $dir=$File::Find::dir;

$dir=~s{^$ROOT_DIR}{$NEW_ROOT};

$files{$dir}++};

sub loadFiles {

find( \&mySub, "$ROOT_DIR/$RELATIVE_PATH_UNDER_ROOT_DIR"); #custom subroutine find, parse $dir

}

die "$NEW_ROOT is not directory" if( ! -d $NEW_ROOT);

die "$ROOT_DIR/$RELATIVE_PATH_UNDER_ROOT_DIR is not directory" if ( ! -d "$ROOT_DIR/$RELATIVE_PATH_UNDER_ROOT_DIR" );

loadFiles();

my $dirs = keys %files;
my $created=0;

foreach my $key (keys %files)
{
if( ! -d $key)
{

if( scalar( mkpath($key)) <= 0) #Add error checking here
{
print STDERR "Failed to create directory: $key!\n";
}
else
{
++$created;
}
}

}

#map { print "$_\n"; } sort keys %files;

print "Total dirs created=$created out of $dirs","\n";