Question:
Hi everyone,
Looking at the example here:
http://www.example-code.com/vbdotnet/spider_mustMatchPattern.asp
Something is wrong with .AddMustMatchPattern
In VS 2012 VB.net, with textbox and button:
Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
' The Chilkat Spider component/library is free.
Dim spider As New Chilkat.Spider()
' --------------------------------------------------------------------
' Note: The URLs in this example are no longer valid.
' You should replace the URLs with URLs from a site of your
' own choosing -- preferably your own site if testing.
' (Google's Directory no longer exists.)
' --------------------------------------------------------------------
' First, we'll get the outbound links for a page in the
' Google directory. Then we'll add some must-match
' and then re-fetch, to see it work...
spider.Initialize("www.dmoz.org")
spider.AddUnspidered("http://www.dmoz.org/Business/Accounting/")
Dim success As Boolean
success = spider.CrawlNext()
' Display the outbound links
Dim i As Long
Dim url As String
For i = 0 To spider.NumOutboundLinks - 1
TextBox1.Text = TextBox1.Text & spider.GetOutboundLink(i) & vbCrLf
Next
' Do it again, but this time with avoid patterns.
spider.Initialize("www.dmoz.org")
spider.AddUnspidered("http://www.dmoz.org/Business/Accounting/") ' Add some must-match patterns:
spider.AddMustMatchPattern("*.com/*")
spider.AddMustMatchPattern("*.net/*")
' Add some avoid-patterns:
spider.AddAvoidOutboundLinkPattern("*.mypages.*")
spider.AddAvoidOutboundLinkPattern("*.personal.*")
spider.AddAvoidOutboundLinkPattern("*.comcast.*")
spider.AddAvoidOutboundLinkPattern("*.aol.*")
spider.AddAvoidOutboundLinkPattern("*~*")
success = spider.CrawlNext()
TextBox1.Text = TextBox1.Text & "-----------------------" & vbCrLf
' Display the outbound links
For i = 0 To spider.NumOutboundLinks - 1
TextBox1.Text = TextBox1.Text & spider.GetOutboundLink(i) & vbCrLf
Next
End Sub
Produces output:
There are a row of "---------" above the line is the straightforward spider.
Below the line should appear those links which match the .AddMustMatchPattern. However nothing appears. It is like the method is blocking everything.
Thanks!
The AddMustMatchPattern can be called one or more times to provide wildcarded strings such that at least one must be matched for the URL to be not ignored.
In the case above, you added 2 must-match patterns:
spider.AddMustMatchPattern("*.com/*")
spider.AddMustMatchPattern("*.net/*")
None of the outbound URLs match any of these patterns. Therefore, they are all excluded. If, for example, there was an outbound link to "http://www.something.com/test.asp" then the "*.com/*" pattern would match it and it would be included in the outbound links.