[htdig3-dev] Problem (and) solved with comments in HTML.cc


J. op den Brouw (MSQL_User@st.hhs.nl)
Tue, 16 Mar 1999 20:05:11 +0100


Hi all

as the title says, I had problems with comments in HTML, and the file
HTML.cc.
What is the problem. Well, most web pages at are school are produces by
people who don't know s**t about HTML.

They produce comments like:

<!hello -->
<!------------hello------------>

The first one is captured by HTML.cc at line 169 and following:

              else
                {
                  // Not a comment declaration after all
                  // but possibly DTD: get to the end

It isn't legal HTML but it is tackled.

But the next one causes more problems:
The code from line 155 and fol.:

              // Not the end of the declaration yet:
              // we'll try to find an actual comment
              if (strncmp((char *)position, "--", 2) == 0)
                {
                  // Found start of comment - now find the end
                  position += 2;
                  q = (unsigned char*)strstr((char *)position, "--");
                  if (!q)
                    {
                      *position = '\0';
                      break; // Rest of document seems to be a
comment...
 
tries to determine if it is a comment. Then next, the code tries to find
the end. This is
done by finding the -- just before the > (comments end with -->). But in
the first comment
case above it fails. Anyway, it messes my indexing. The trick is (I
HOPE) that line 161:

                  q = (unsigned char*)strstr((char *)position, "--");

should be changed in:

                  q = (unsigned char*)strstr((char *)position, "-->");

It finds the first occurence of --> so don't recurse comments. Anyway,
it works on my htdig system.

Another problem is that M$ Frontpage 98 in combination with Frontpage
Server Extension don't do
<AREA> tags. They create a webbot (inside a comment). If the webbot has
links, these links don't
get indexed. Of couse this is a M$ / user problem, it just that you know
of it.

Hope you had some time to read it.

Greetz

--jesse
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Tue Mar 16 1999 - 11:26:54 PST