Archived Forum Post

Index of archived forum posts

Question:

Spider does not crawl https Urls

Mar 17 '13 at 21:47

and it doesn't help to issue spider.put_AvoidHttps(False) before crawling


Answer

I did not find a problem.

Here's my simple C++ test program:

void spiderTest(void)
    {
    CkSpider spider;

const char *url = "http://www.chilkatsoft.com/crawlStart.html";
const char *domain = "www.chilkatsoft.com";

spider.Initialize(domain);
spider.AddUnspidered(url);
spider.put_CacheDir("c:/aaworkarea/spiderCache");

//  Begin crawling the site by calling CrawlNext repeatedly.
int i,total;
total=0;
for (i = 0; i < 10; i++)
{
    bool success;
    success = spider.CrawlNext();
    if (success == true)
    {
        total++;
        if(spider.get_LastFromCache())
    {
    printf("Downloaded from cache: %s\n",spider.lastUrl());
    }
        else
    {
    printf("Downloaded from Internet: %s\n",spider.lastUrl());

    spider.SleepMs(1000);
    }
    }
    else
    {
    if (spider.get_NumUnspidered() == 0)
    {
    printf("No more URLs to spider\n");
    }
    else
    {
    printf("%s\n",spider.lastErrorText());
    }
    break;

    }

}
}

Answer

I have the same problem, I cant crawl the following webpage: https://naxom.se/

what can be the problem? https or the link structure?