Re: extracting data from html files


Subject: Re: extracting data from html files
From: Michael Fowler (michael@shoebox.net)
Date: Fri Jun 28 2002 - 13:49:45 AKDT


On Fri, Jun 28, 2002 at 11:46:16AM -0800, Christopher Swingley wrote:
> * Bob Crosby <rcrosby@alaska.net> [2002-Jun-28 10:58 AKDT]:
> > That works great. It returns just what I asked for. Now I'm wondering if
> > sed can be used to do even more sophisticated editing? For example, given
> > a bunch of files, each containing multiple blocks of text like the following:
> >
> > <TD width="300" valign="top"><FONT class=Price>$48.00</FONT><BR>
> > <a href="JavaScript: funcname('../doit.asp', '', '', '',
> > '100');">Widgetname</a><BR>
> > Product_name<BR>
> > Product_ID<BR>
> >
> > could I generate a csv file consisting of lines containing
> > Price,Widgetname,Product_name,Product_ID?
[snip]
> Once you start using Perl's regular expression syntax, you'll wonder how
> you did without it.

I just caught the tail end of this conversation with the mention of Perl.
As an alternative to using Perl's regular expression, there is a Perl module
on CPAN called HTML::TableExtract, which is ideally suited to parsing the
HTML you describe.

Given this sample HTML (in a file named "bc.html"):

    <TABLE>
        <TR>
            <TD>Price</TD>
            <TD>Widget</TD>
            <TD>Product Name</TD>
            <TD>Product ID</TD>
        </TR>
    
        <TR>
            <TD><FONT class="Price">$48.00</FONT></TD>
            <TD><A href="javascript:runcode()">FooWidget</A></TD>
            <TD>FooProduct</TD>
            <TD>Foo</TD>
        </TR>
    
        <TR>
            <TD><FONT class="Price">$120.00</FONT></TD>
            <TD><A href="javascript:runcode()">BarWidget</A></TD>
            <TD>BarProduct</TD>
            <TD>Bar</TD>
        </TR>
    
        <TR>
            <TD><FONT class="Price">$4.00</FONT></TD>
            <TD><A href="javascript:runcode()">BazWidget</A></TD>
            <TD>BazProduct</TD>
            <TD>Baz</TD>
        </TR>
    </TABLE>

Running this example code (indented for clarity):

    #!/usr/bin/perl -w
    
    use HTML::TableExtract;
    use strict;
    
    
    my $te = HTML::TableExtract->new(
        headers => ['Price', 'Widget', 'Product Name', 'Product ID']
    );
    $te->parse_file("bc.html");
    
    
    foreach my $ts ($te->table_states) {
        foreach my $row ($ts->rows) {
            print join(",", @$row), "\n";
        }
    }

Produces:

    $48.00,FooWidget,FooProduct,Foo
    $120.00,BarWidget,BarProduct,Bar
    $4.00,BazWidget,BazProduct,Baz

The HTML::TableExtract module looks for a table with the column headers
"Price", "Widget", "Product Name", and "Product ID" and extracts the text
corresponding to those columns.

I'm not sure how familiar the original poster is with Perl, so this may not
be entirely appropriate, but for someone familiar with Perl this module can
be a very powerful and easy to use tool.

Michael

--
Administrator                      www.shoebox.net
Programmer, System Administrator   www.gallanttech.com
--

--------- To unsubscribe, send email to <aklug-request@aklug.org> with 'unsubscribe' in the message body.



This archive was generated by hypermail 2a23 : Fri Jun 28 2002 - 13:49:15 AKDT