Re: Help with sorting another file


Subject: Re: Help with sorting another file
jonr@destar.net
Date: Wed Feb 25 2004 - 23:05:40 AKST


>
> On Wed, 25 Feb 2004 jonr@destar.net wrote:
>>
>> Can one of you throw me another script that would parse the test between
>> the first < > and if it doesn't have a stop to delete that entire entry
>> from <programme start="........> to and including </programme>?
>>
>> I have been studying Arthurs sed and awk statements and am begining to
>> understand it but nowhere near able to do this.
>
> cat $file | sed -n '/^<programme start=.*stop=.*/,/<\/programme>/p' > \
> ${file}.new
>
> --Arthur Corliss
> Bolverk's Lair -- http://arthur.corlissfamily.org/
> Digital Mages -- http://www.digitalmages.com/
> "Live Free or Die, the Only Way to Live" -- NH State Motto

cat TV.xml1 | sed -n '/^<programme start=.*stop=.*/,/<\/programme>/p' >
TV.xml.new
(This is the way I run it)

Ok, so I am studying this line to delete any line between <programme
start=* and </programme> that does not have a stop= in it. When this is
ran it outputs to a file named TV.xml.new but the file is empty. Here is
what I understand so far:

cat reads the file into ?memory? or a ?buffer? it then pipes this into sed.

The -n says run this without showing any output to the screen.

The first '/ is the beginning of the encapsulation of the regexp that is
to be evaluated.

The ^ means whatever is after the ^ must appear at the beginning of the line.

The . at the end of <programme start=. matches any character that comes
after it until a new modifier is found. (I think)

The * before and after *stop=.* is to match all characters on a line.

The rest I can't quite figure out, I think the next / after the stop
closes the evaluation of the line.

I think the comma does nothing more than seperate the two evaluated strings.

The next / begins a new evaluation of another part of the line.

The forward slash inside the <\/programme> I have no idea what this means.

The last / is to once again close this new evaluation.

The p I also have no idea what this does.

The final ' ends the entire regexp that was to be evaluated.

And the > says to output to a new file.

Where in all of this is it being told to delete and how does it know to
just delete the line it is evaluating if the stop= isn't in there but not
to if it is? Is this what the p does at the end of the line? And what is
the significance of the back slash inside the <\/programme>?

In the info page for sed it says that the . matches any character but if
you look at the bottom of this page where I have posted what this is
supposed to look like when correct and what should be deleted if incorrect
after the first <programme start= it is not a single character but a
timestamp of when the movie starts enclosed within " ". Is that ok, does
the period mean more than just a single character?

So this file has more than just these <programme start=> and </programme>
in it. There is other data that is at the beginning of the file and the
programmes start after this. After running this it outputs nothing into
the newfile which is created but has no data. I need to be able to delete
from and including <programme start=<date> to and including the
</programme> if it does not have a stop= after the first <programme
start=<date> and only if it does not have a stop= in the regexp.

One thing I noticed was that in the 'info sed' man page it uses backticks
` but if you add these in the sed doesn't work at all.

CORRECT FORMAT:
<programme start="20040225000000 AKST" stop="20040225003000 AKST"
channel="C23toonp.zap2it.com">
    <title>Big O</title>
    <sub-title>Winter Night Phantom</sub-title>
    <rating system="VCHIP">
      <value>PG</value>
    </rating>
  </programme>

INCORRECT FORMAT: Needs to be deleted
<programme start="20040225233000 AKST" channel="C58tvlandp.zap2it.com">
    <title>All in the Family</title>
    <sub-title>Edith Breaks Out</sub-title>
    <desc>Edith defies an ultimatum from Archie.</desc>
    <rating system="VCHIP">
      <value>PG</value>
    </rating>
  </programme>

Thanks again for all the help,

Jon
---------
To unsubscribe, send email to <aklug-request@aklug.org>
with 'unsubscribe' in the message body.



This archive was generated by hypermail 2a23 : Wed Feb 25 2004 - 22:54:55 AKST