我想搜索网站的每个页面。我的想法是找到页面上保留在域内的所有链接,访问它们,然后重复。我也必须采取措施,避免重复努力。
所以开始很容易:
page = 'http://example.com'
nf = Nokogiri::HTML(open(page))
links = nf.xpath '//a' #find all links on current page
main_links = links.map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq
“main_links”现在是活动页面中以“/”开头的链接数组(应该仅是当前域上的链接)。
从这里我可以将这些链接提供并读取到上面的类似代码中,但我不知道确保不重复自己的最佳方法。我想我在访问它们时开始收集所有访问过的链接:
main_links.each do |ml|
visited_links = [] #new array of what is visted
np = Nokogiri::HTML(open(page + ml)) #load the first main_link
visted_links.push(ml) #push the page we're on
np_links = np.xpath('//a').map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq #grab all links on this page pointing to the current domain
main_links.push(np_links).compact.uniq #remove duplicates after pushing?
end
我仍在研究最后一点......但这看起来是正确的方法吗?
Thanks.
其他人建议您不要编写自己的网络爬虫。我同意这一点if性能和稳健性是您的目标。然而,这可能是一个很好的学习练习。你写了这个:
“[……]但我不知道确保我不会重蹈覆辙的最佳方法”
递归是这里的关键。像下面的代码:
require 'set'
require 'uri'
require 'nokogiri'
require 'open-uri'
def crawl_site( starting_at, &each_page )
files = %w[png jpeg jpg gif svg txt js css zip gz]
starting_uri = URI.parse(starting_at)
seen_pages = Set.new # Keep track of what we've seen
crawl_page = ->(page_uri) do # A re-usable mini-function
unless seen_pages.include?(page_uri)
seen_pages << page_uri # Record that we've seen this
begin
doc = Nokogiri.HTML(open(page_uri)) # Get the page
each_page.call(doc,page_uri) # Yield page and URI to the block
# Find all the links on the page
hrefs = doc.css('a[href]').map{ |a| a['href'] }
# Make these URIs, throwing out problem ones like mailto:
uris = hrefs.map{ |href| URI.join( page_uri, href ) rescue nil }.compact
# Pare it down to only those pages that are on the same site
uris.select!{ |uri| uri.host == starting_uri.host }
# Throw out links to files (this could be more efficient with regex)
uris.reject!{ |uri| files.any?{ |ext| uri.path.end_with?(".#{ext}") } }
# Remove #foo fragments so that sub-page links aren't differentiated
uris.each{ |uri| uri.fragment = nil }
# Recursively crawl the child URIs
uris.each{ |uri| crawl_page.call(uri) }
rescue OpenURI::HTTPError # Guard against 404s
warn "Skipping invalid link #{page_uri}"
end
end
end
crawl_page.call( starting_uri ) # Kick it all off!
end
crawl_site('http://phrogz.net/') do |page,uri|
# page here is a Nokogiri HTML document
# uri is a URI instance with the address of the page
puts uri
end
简而言之:
- 使用以下方式跟踪您浏览过的页面
Set
。这样做不是通过href
值,但由完整的规范 URI 决定。
- Use
URI.join
将可能的相对路径转换为相对于当前页面的正确 URI。
- 使用递归继续抓取每个页面上的每个链接,但如果您已经看过该页面,则退出。
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)