Re: [htdig3-dev] URLs containing : nor indexed?


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Wed, 17 Mar 1999 14:17:16 -0600 (CST)


According to Bodo Bauer:
> Gilles Detillieux (grdetil@scrc.umanitoba.ca) wrote:
> > According to Bodo Bauer:
> > > I try to set up htdig for our website, to index our mailinglist
> > > archives. Unfortunatly it seems to ignore exaclty these links.
> > >
> > > The Arcives are stored in directories containing a colon (like 1999:Feb)
> > > for february 1999. If I start within such a subdir it works
> > >
> > > start_url: http://www.suse.com/Mailinglists/suse-informix/1999:Feb/
> > >
> > > but
> > >
> > > start_url: http://www.suse.com/Mailinglists/suse-informix
> > >
> > > doesn't see this subdir. The index file there however contians
> > > all the links...
> > >
> > > Any idea?
> >
> > It contains all the links, but the links are not complete. They're all
> > missing their closing </a> tag. htdig doesn't process <a href=...> tags
> > until it finds the closing </a> tag, so these are just getting ignored.
>
> Thanks a lot for finding this bug. How emmbarrising, could have seen this myself.
> I looked about a hunderd times on the HTTP code yesterday looking for some
> kind of error. I fixed the script generating these pages and now it works!
>
> Sorry for bothering you...

Not at all. It was one that was hard to spot, and htdig didn't give any
error messages to point the way. Here's a patch to htdig/HTML.cc that
should make it handle this situation better in the future...

--- htdig/HTML.cc.hrefunterm Wed Mar 17 11:01:08 1999
+++ htdig/HTML.cc Wed Mar 17 14:06:37 1999
@@ -465,6 +465,16 @@ HTML::do_tag(Retriever &retriever, Strin
                                 q++;
                             *q = '\0';
                         }
+ if (in_ref)
+ {
+ if (debug > 1)
+ cout << "Terminating previous <a href=...> tag,"
+ << " which didn't have a closing </a> tag."
+ << endl;
+ if (dofollow)
+ retriever.got_href(*href, description);
+ in_ref = 0;
+ }
                         delete href;
                         href = new URL(position, *base);
                         in_ref = 1;

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Wed Mar 17 1999 - 12:36:30 PST