At 08:16 03/02/2004, Iljitsch van Beijnum wrote:
I wrote a program in C that takes a Netscape bookmark file and stores the
content in a database. This is just under 300 lines and it's pretty
stupid, it certainly can't handle all variations of HTML.
It could be lack of programming prowess on my part, but I find parsing
HTML / XML syntax incredibly inconvenient. The most troubleshome part is
that you can't just work left to right, you have to look for close tags
and so on.
Also, it just doesn't make any sense.
Why is it <input type=blah> but <title>blah</title> ? Something like
input="blah" title="blah" would be much better.
I couldn't agree more. I wrote a basic XML parser which could do just what
we needed and no more and it was 259 lines long (that's not counting the
STL libraries it used). This built up a tree structure of the XML. There
was even more code to grab the particular XML element I wanted from that
tree structure. I can't see how you could do it in 12 lines apart from just
to find a specific tag value.
OTOH, my RFC822 header email-address-aware parser is only 220 lines long.
(If I didn't need to parse email addresses it would be MUCH shorter). A
parser for a better designed plain text metadata format could easily be in
the region of 50 lines or less.
I don't like RFC822 headers, but I think there are simpler alternatives to
XML which I'd prefer. I wouldn't die if it did turn out to be XML, but I'd
like a good reason rather than 'it's the new way of doing things'.
Paul VPOP3 - Internet Email Server/Gateway
support(_at_)pscs(_dot_)co(_dot_)uk http://www.pscs.co.uk/