J. op den Brouw (MSQL_User@st.hhs.nl)
Tue, 16 Mar 1999 20:05:11 +0100
Hi all
as the title says, I had problems with comments in HTML, and the file
HTML.cc.
What is the problem. Well, most web pages at are school are produces by
people who don't know s**t about HTML.
They produce comments like:
<!hello -->
<!------------hello------------>
The first one is captured by HTML.cc at line 169 and following:
else
{
// Not a comment declaration after all
// but possibly DTD: get to the end
It isn't legal HTML but it is tackled.
But the next one causes more problems:
The code from line 155 and fol.:
// Not the end of the declaration yet:
// we'll try to find an actual comment
if (strncmp((char *)position, "--", 2) == 0)
{
// Found start of comment - now find the end
position += 2;
q = (unsigned char*)strstr((char *)position, "--");
if (!q)
{
*position = '\0';
break; // Rest of document seems to be a
comment...
tries to determine if it is a comment. Then next, the code tries to find
the end. This is
done by finding the -- just before the > (comments end with -->). But in
the first comment
case above it fails. Anyway, it messes my indexing. The trick is (I
HOPE) that line 161:
q = (unsigned char*)strstr((char *)position, "--");
should be changed in:
q = (unsigned char*)strstr((char *)position, "-->");
It finds the first occurence of --> so don't recurse comments. Anyway,
it works on my htdig system.
Another problem is that M$ Frontpage 98 in combination with Frontpage
Server Extension don't do
<AREA> tags. They create a webbot (inside a comment). If the webbot has
links, these links don't
get indexed. Of couse this is a M$ / user problem, it just that you know
of it.
Hope you had some time to read it.
Greetz
--jesse
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Tue Mar 16 1999 - 11:26:54 PST