Re: extracting data from html files

Subject: Re: extracting data from html files
From: Michael Fowler (michael@shoebox.net)
Date: Fri Jun 28 2002 - 13:49:45 AKDT

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Next message: James Gibson: "RE: apache sec hole.."
Previous message: James Zuelow: "RE: apache sec hole.."
In reply to: Christopher Swingley: "Re: extracting data from html files"
Next in thread: James Gibson: "Re: extracting data from html files"
Reply: Michael Fowler: "Re: extracting data from html files"
Reply: Michael Fowler: "Re: extracting data from html files"

On Fri, Jun 28, 2002 at 11:46:16AM -0800, Christopher Swingley wrote:
> * Bob Crosby <rcrosby@alaska.net> [2002-Jun-28 10:58 AKDT]:
> > That works great. It returns just what I asked for. Now I'm wondering if
> > sed can be used to do even more sophisticated editing? For example, given
> > a bunch of files, each containing multiple blocks of text like the following:
> >
> > <TD width="300" valign="top">$48.00 
> > <a href="JavaScript: funcname('../doit.asp', '', '', '',
> > '100');">Widgetname</a> 
> > Product_name 
> > Product_ID 
> >
> > could I generate a csv file consisting of lines containing
> > Price,Widgetname,Product_name,Product_ID?
[snip]
> Once you start using Perl's regular expression syntax, you'll wonder how
> you did without it.

I just caught the tail end of this conversation with the mention of Perl.
As an alternative to using Perl's regular expression, there is a Perl module
on CPAN called HTML::TableExtract, which is ideally suited to parsing the
HTML you describe.

Given this sample HTML (in a file named "bc.html"):

<TABLE>
 <TR>
 <TD>Price</TD>
 <TD>Widget</TD>
 <TD>Product Name</TD>
 <TD>Product ID</TD>
 </TR>

 <TR>
 <TD>$48.00</TD>
 <TD><A href="javascript:runcode()">FooWidget</A></TD>
 <TD>FooProduct</TD>
 <TD>Foo</TD>
 </TR>

 <TR>
 <TD>$120.00</TD>
 <TD><A href="javascript:runcode()">BarWidget</A></TD>
 <TD>BarProduct</TD>
 <TD>Bar</TD>
 </TR>

 <TR>
 <TD>$4.00</TD>
 <TD><A href="javascript:runcode()">BazWidget</A></TD>
 <TD>BazProduct</TD>
 <TD>Baz</TD>
 </TR>
 </TABLE>

Running this example code (indented for clarity):

    #!/usr/bin/perl -w

    use HTML::TableExtract;
    use strict;


    my $te = HTML::TableExtract->new(
        headers => ['Price', 'Widget', 'Product Name', 'Product ID']
    );
    $te->parse_file("bc.html");


    foreach my $ts ($te->table_states) {
        foreach my $row ($ts->rows) {
            print join(",", @$row), "\n";
        }
    }

Produces:

    $48.00,FooWidget,FooProduct,Foo
    $120.00,BarWidget,BarProduct,Bar
    $4.00,BazWidget,BazProduct,Baz

The HTML::TableExtract module looks for a table with the column headers
"Price", "Widget", "Product Name", and "Product ID" and extracts the text
corresponding to those columns.

I'm not sure how familiar the original poster is with Perl, so this may not
be entirely appropriate, but for someone familiar with Perl this module can
be a very powerful and easy to use tool.

Michael

-- Administrator www.shoebox.net Programmer, System Administrator www.gallanttech.com --

--------- To unsubscribe, send email to <aklug-request@aklug.org> with 'unsubscribe' in the message body.

Next message: James Gibson: "RE: apache sec hole.."
Previous message: James Zuelow: "RE: apache sec hole.."
In reply to: Christopher Swingley: "Re: extracting data from html files"
Next in thread: James Gibson: "Re: extracting data from html files"
Reply: Michael Fowler: "Re: extracting data from html files"
Reply: Michael Fowler: "Re: extracting data from html files"

This archive was generated by hypermail 2a23 : Fri Jun 28 2002 - 13:49:15 AKDT