Java爬虫框架WebMagic的使用总结

2023-11-17

最近，项目做一个公司新闻网站，分为PC&移动端(h5)，数据来源是从HSZX与huanqiu2个网站爬取，主要使用Java编写的WebMagic作为爬虫框架，数据分为批量抓取、增量抓取，批量抓当前所有历史数据，增量需要每10分钟定时抓取一次，由于从2个网站抓取，并且频道很多，数据量大，更新频繁；开发过程中遇到很多的坑，今天腾出时间，感觉有必要做以总结。

工具说明：

1、WebMagic是一个简单灵活的爬虫框架。基于WebMagic，你可以快速开发出一个高效、易维护的爬虫。

官网地址：http://webmagic.io/

文档说明：http://webmagic.io/docs/zh/

2、jsoup是Java的一个html解析工作，解析性能很不错。

文档地址：http://www.open-open.com/jsoup/

3、Jdiy一款超轻量的java极速开发框架，javaEE/javaSE环境均适用，便捷的数据库CRUD操作API。支持各大主流数据库。

官网地址：http://www.jdiy.org/jdiy.jd

一、使用到的技术，如下：
WebMagic作为爬虫框架、httpclient作为获取网页工具、Jsoup作为分析页面定位抓取内容、ExecutorService线程池作为定时增量抓取、Jdiy作为持久层框架

二、历史抓取代码，如下:

[java] view plain copy

package com.spider.huanqiu.history;
import java.util.ArrayList;
import java.util.List;
import org.apache.commons.lang3.StringUtils;
import org.jdiy.core.Rs;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import com.spider.huasheng.history.Pindao;
import com.spider.utils.Config;
import com.spider.utils.ConfigBase;
import com.spider.utils.DateUtil;
import com.spider.utils.HttpClientUtil;
import com.spider.utils.service.CommService;
/**
* 描述：抓取xxx-国际频道历史数据
* 创建时间：2016-11-9
* @author Jibaole
*/
public class HQNewsDao extends ConfigBase implements PageProcessor{
public static final String index_list = "(.*).huanqiu.com/(.*)pindao=(.*)";//校验地址正则
public static String pic_dir = fun.getProValue(PINDAO_PIC_FILE_PATH);//获取图片保存路径
// 部分一：抓取网站的相关配置，包括编码、重试次数、抓取间隔、超时时间、请求消息头、UA信息等
private Site site = Site.me().setRetryTimes(3).setSleepTime(1000).setTimeOut(6000)
.addHeader("Accept-Encoding", "/").setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.59 Safari/537.36");
@Override
public Site getSite() {
return site;
}
@Override
public void process(Page page) {
try {
//列表页
if (page.getUrl().regex(index_list).match()) {
List<String> Urllist =new ArrayList<String>();
String url =page.getUrl().toString();
String pageUrl = url.substring(0,url.lastIndexOf("?"));
String pindaoId =url.substring(url.lastIndexOf("=")+1);
Urllist = saveNewsListData(pageUrl,pindaoId);
page.addTargetRequests(Urllist);//添加地址，根据url对该地址处理
}
//可增加else if 处理不同URL地址
} catch (Exception e) {
e.printStackTrace();
}
}
private List<String> saveNewsListData(String pageUrl,String pindaoId) {
List<String> urlList = new ArrayList<String>();
Document docList = null;
String newsIdFirst="";
String pageListStr=HttpClientUtil.getPage(pageUrl);//HttpClientUtil方式获取网页内容
if(StringUtils.isNotEmpty(pageListStr)){
try {
docList = Jsoup.parse(pageListStr);
Elements fallsFlow=docList.getElementsByClass("fallsFlow");
if(!fallsFlow.isEmpty()){
Elements liTag=fallsFlow.get(0).getElementsByTag("li");
if(!liTag.isEmpty()){
for(int i=0;i<liTag.size();i++){
String title="",contentUrl="",newsId="",pic="",absContent="",pushTime="",timeFalg="";
Element obj=liTag.get(i);
try{
contentUrl=obj.getElementsByTag("h3").select("a").attr("href");
if(StringUtils.isNotEmpty(contentUrl)){
title=obj.getElementsByTag("h3").select("a").attr("title");//标题
Rs isTitle = CommService.checkNewsName(title); //校验新闻标题
if(!isTitle.isNull()){
continue;
}
System.err.println("<<<<<<--DAO------当前抓取文章为(xxx历史):"+title+"------------");
newsId = contentUrl.substring(contentUrl.lastIndexOf("/") + 1,contentUrl.lastIndexOf(".html"));
if(!pageUrl.contains(".htm") && i == 0){
newsIdFirst = newsId;
}
//图片
if(!obj.getElementsByTag("img").attr("src").isEmpty()){
pic=obj.getElementsByTag("img").first().attr("src");
if(StringUtils.isNotEmpty(pic) ){
pic = fun.downloadPic(pic,pic_dir+"list/"+newsId+"/");//获取列表图片，保存本地
}
}
if(!obj.getElementsByTag("h5").isEmpty()){
//简介
absContent = obj.getElementsByTag("h5").first().text();
if(StringUtils.isNotEmpty(absContent) && absContent.indexOf("[")>0){
absContent = absContent.substring(0, absContent.indexOf("["));
}
}
if(!obj.getElementsByTag("h6").isEmpty()){
pushTime = obj.getElementsByTag("h6").text();
timeFalg=pushTime.substring(0, 4);
}
String hrmlStr=HttpClientUtil.getPage(contentUrl);
if(StringUtils.isNotEmpty(hrmlStr)){
Document docPage = Jsoup.parse(hrmlStr);
Elements pageContent = docPage.getElementsByClass("conText");
if(!pageContent.isEmpty()){
String comefrom = pageContent.get(0).getElementsByClass("fromSummary").text();//来源
if(StringUtils.isNotEmpty(comefrom) && comefrom.contains("环球")){
String author=pageContent.get(0).getElementsByClass("author").text();//作者
Element contentDom = pageContent.get(0).getElementById("text");
if(!contentDom.getElementsByTag("a").isEmpty()){
contentDom.getElementsByTag("a").removeAttr("href");//移除外跳连接
}
if(!contentDom.getElementsByClass("reTopics").isEmpty()){
contentDom.getElementsByClass("reTopics").remove();//推荐位
}
if(!contentDom.getElementsByClass("spTopic").isEmpty()){
contentDom.getElementsByClass("spTopic").remove(); //去除排行榜列表
}
if(!contentDom.getElementsByClass("editorSign").isEmpty()){
contentDom.getElementsByClass("editorSign").remove();//移除编辑标签
}
String content = contentDom.toString();
if(!StringUtils.isEmpty(content)){
content = content.replaceAll("\r\n|\r|\n|\t|\b|~|\f", "");//去掉回车换行符
content = replaceForNews(content,pic_dir+"article/"+newsId+"/");//替换内容中的图片
while (true) {
if(content.indexOf("<!--") ==-1 && content.indexOf("<script") == -1){
break;
}else{
if(content.indexOf("")>0){
String moveContent= content.substring(content.indexOf("")+3);//去除注释
content = content.replace(moveContent, "");
}
if(content.indexOf("<script") >0 && content.lastIndexOf("</script>")>0){
String moveContent= content.substring(content.indexOf("<script"), content.indexOf("</script>")+9);//去除JS
content = content.replace(moveContent, "");
}
}
}
}
if(StringUtils.isEmpty(timeFalg) || "2016".equals(timeFalg) ||
"28".equals(pindaoId) || "29".equals(pindaoId) || "30".equals(pindaoId)){
Rs news= new Rs("News");
news.set("title", title);
news.set("shortTitle",title);
news.set("beizhu",absContent);
news.set("savetime", pushTime);
if(StringUtils.isNotEmpty(pic)){
news.set("path", pic);
news.set("mini_image", pic);
}
news.set("pindaoId", pindaoId);
news.set("status", 0);//不显示
news.set("canComment", 1);//是否被评论
news.set("syn", 1);//是否异步
news.set("type", 1);//是否异步
news.set("comefrom",comefrom);
news.set("author", author);
news.set("content", content);
news.set("content2", content);
CommService.save(news);
System.err.println("------新增(xxx历史):"+title+"------>>>>>>>");
}else{
break;
}
}
}
}
}
}catch (Exception e) {
e.printStackTrace();
}
}
}
if(!pageUrl.contains(".htm")){
//得到分页内容
Element pages = docList.getElementById("pages");
int num = pages.getElementsByTag("a").size();
String pageMaxStr = pages.getElementsByTag("a").get(num-2).text();
int pageMax=0;
if(StringUtils.isNotEmpty(pageMaxStr)){
pageMax= Integer.parseInt(pageMaxStr);
}
if(pageMax>historyMaxPage){//控制历史抓取页数
pageMax = historyMaxPage;
}
for(int i=1 ;i<pageMax;i++){//翻页请求
String link = "";
link = pageUrl+(i+1)+".html?pindao="+pindaoId;
urlList.add(link);//循环处理url，翻页内容
}
//获取增量标识
Rs flag = CommService.checkPd(pindaoId,pageUrl,Config.SITE_HQ);
//初始化
if(flag.isNull()){
Rs task= new Rs("TaskInfo");
task.set("pindao_id", pindaoId);
task.set("news_id", newsIdFirst);
task.set("page_url", pageUrl);
task.set("site", Config.SITE_HQ);
task.set("create_time", DateUtil.fullDate());
CommService.save(task);
}
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
return urlList;
}
public static void main(String[] args) {
List<String> strList=new ArrayList<String>();
strList.add("http://www.xxx/exclusive/?pindao="+Pindao.getKey("国际"));
//滚动新闻
strList.add("http://www.xxx/article/?pindao="+Pindao.getKey("国际"));
for(String str:strList){
Spider.create(new HQNewsDao()).addUrl(str).thread(1).run();
}
}
//所有频道Action
public static void runNewsList(List<String> strList){
for(String str:strList){
Spider.create(new HQNewsDao()).addUrl(str).thread(1).run(); //添加爬取地址、设置线程数
}
}
}

三、增量抓取代码，如下(在历史上改动):

说明：增量每10分钟执行一次，每次只抓取最新一页数据，根据增量标识(上一次第一条新闻news_id)，存在相同news_id或一页爬完就终止抓取。

[java] view plain copy

package com.spider.huanqiu.task;
import java.util.ArrayList;
import java.util.List;
import org.apache.commons.lang3.StringUtils;
import org.jdiy.core.Rs;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import com.spider.huasheng.history.Pindao;
import com.spider.utils.Config;
import com.spider.utils.ConfigBase;
import com.spider.utils.DateUtil;
import com.spider.utils.HttpClientUtil;
import com.spider.utils.service.CommService;
public class HQNewsTaskDao extends ConfigBase implements PageProcessor{
public static final String index_list = "(.*).huanqiu.com/(.*)pindao=(.*)";
public static String pic_dir = fun.getProValue(PINDAO_PIC_FILE_PATH);
public static String new_id="";
// 部分一：抓取网站的相关配置，包括编码、抓取间隔、重试次数等
private Site site = Site.me().setRetryTimes(3).setSleepTime(1000).setTimeOut(6000)
.addHeader("Accept-Encoding", "/").setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.59 Safari/537.36");
@Override
public Site getSite() {
return site;
}
@Override
public void process(Page page) {
try {
//列表页
if (page.getUrl().regex(index_list).match()) {
List<String> Urllist =new ArrayList<String>();
String url =page.getUrl().toString();
String pageUrl = url.substring(0,url.lastIndexOf("?"));
String pindaoId =url.substring(url.lastIndexOf("=")+1);
Rs isFlag = CommService.checkPd(pindaoId,pageUrl,Config.SITE_HQ);
if(!isFlag.isNull()){
new_id=isFlag.getString("news_id");
}
Urllist = saveNewsListData(pageUrl,pindaoId);
page.addTargetRequests(Urllist);
}
} catch (Exception e) {
e.printStackTrace();
}
}
private List<String> saveNewsListData(String pageUrl,String pindaoId) {
List<String> urlList = new ArrayList<String>();
Document docList = null;
String pageListStr=HttpClientUtil.getPage(pageUrl);
if(StringUtils.isNotEmpty(pageListStr)){
try {
docList = Jsoup.parse(pageListStr);
Elements fallsFlow=docList.getElementsByClass("fallsFlow");
if(!fallsFlow.isEmpty()){
String newsIdFirst="";
Boolean isIng = true;
Elements liTag=fallsFlow.get(0).getElementsByTag("li");
if(!liTag.isEmpty()){
for(int i=0;i<liTag.size() && isIng;i++){
String title="",contentUrl="",newsId="",pic="",absContent="",pushTime="";
Element obj=liTag.get(i);
try{
contentUrl=obj.getElementsByTag("h3").select("a").attr("href");
if(StringUtils.isNotEmpty(contentUrl)){
title=obj.getElementsByTag("h3").select("a").attr("title");//标题
Rs isTitle = CommService.checkNewsName(title); //校验新闻标题
if(!isTitle.isNull()){
continue;
}
System.err.println("---------当前抓取文章为(增量):"+title+"------------");
newsId = contentUrl.substring(contentUrl.lastIndexOf("/") + 1,contentUrl.lastIndexOf(".html"));
if(!newsId.equals(new_id)){
if(!pageUrl.contains(".htm") && i == 0){
newsIdFirst = newsId;
}
//图片
if(!obj.getElementsByTag("img").attr("src").isEmpty()){
pic=obj.getElementsByTag("img").first().attr("src");
if(StringUtils.isNotEmpty(pic) ){
pic = fun.downloadPic(pic,pic_dir+"list/"+newsId+"/");
}
}
if(!obj.getElementsByTag("h5").isEmpty()){
//简介
absContent = obj.getElementsByTag("h5").first().text();
if(StringUtils.isNotEmpty(absContent) && absContent.indexOf("[")>0){
absContent = absContent.substring(0, absContent.indexOf("["));
}
}
if(!obj.getElementsByTag("h6").isEmpty()){
pushTime = obj.getElementsByTag("h6").text();
}
String hrmlStr=HttpClientUtil.getPage(contentUrl);
if(StringUtils.isNotEmpty(hrmlStr)){
Document docPage = Jsoup.parse(hrmlStr);
Elements pageContent = docPage.getElementsByClass("conText");
if(!pageContent.isEmpty()){
String comefrom = pageContent.get(0).getElementsByClass("fromSummary").text();//来源
if(StringUtils.isNotEmpty(comefrom) && comefrom.contains("环球")){
String author=pageContent.get(0).getElementsByClass("author").text();//作者
Element contentDom = pageContent.get(0).getElementById("text");
if(!contentDom.getElementsByTag("a").isEmpty()){
contentDom.getElementsByTag("a").removeAttr("href");//移除外跳连接
}
if(!contentDom.getElementsByClass("reTopics").isEmpty()){
contentDom.getElementsByClass("reTopics").remove();//推荐位
}
if(!contentDom.getElementsByClass("spTopic").isEmpty()){
contentDom.getElementsByClass("spTopic").remove();
}
if(!contentDom.getElementsByClass("editorSign").isEmpty()){
contentDom.getElementsByClass("editorSign").remove();//移除编辑
}
String content = contentDom.toString();
if(!StringUtils.isEmpty(content)){
content = content.replaceAll("\r\n|\r|\n|\t|\b|~|\f", "");//去掉回车换行符
content = replaceForNews(content,pic_dir+"article/"+newsId+"/");//替换内容中的图片
while (true) {
if(content.indexOf("<!--") ==-1 && content.indexOf("<script") == -1){
break;
}else{
if(content.indexOf("")>0){
String moveContent= content.substring(content.indexOf("")+3);//去除注释
content = content.replace(moveContent, "");
}
if(content.indexOf("<script") >0 && content.lastIndexOf("</script>")>0){
String moveContent= content.substring(content.indexOf("<script"), content.indexOf("</script>")+9);//去除JS
content = content.replace(moveContent, "");
}
}
}
}
if(StringUtils.isNotEmpty(content) && StringUtils.isNotEmpty(title)){
Rs news= new Rs("News");
news.set("title", title);
news.set("shortTitle",title);
news.set("beizhu",absContent);
news.set("savetime", pushTime);
if(StringUtils.isNotEmpty(pic)){
news.set("path", pic);
news.set("mini_image", pic);
}
news.set("pindaoId", pindaoId);
news.set("status", 1);//不显示
news.set("canComment", 1);//是否被评论
news.set("syn", 1);//是否异步
news.set("type", 1);//是否异步
news.set("comefrom",comefrom);
news.set("author", author);
news.set("content", content);
news.set("content2", content);
CommService.save(news);
}
}
}
}
}else{
isIng=false;
break;
}
}
}catch (Exception e) {
e.printStackTrace();
}
}
}
if(!pageUrl.contains(".htm")){
//增量标识
Rs flag = CommService.checkPd(pindaoId,pageUrl,Config.SITE_HQ);
//初始化
if(flag.isNull()){
Rs task= new Rs("TaskInfo");
task.set("pindao_id", pindaoId);
task.set("news_id", newsIdFirst);
task.set("page_url", pageUrl);
task.set("site", Config.SITE_HQ);
task.set("create_time", DateUtil.fullDate());
CommService.save(task);
}else if(StringUtils.isNotEmpty(newsIdFirst)){
flag.set("news_id", newsIdFirst);
flag.set("update_time", DateUtil.fullDate());
CommService.save(flag);
}
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
return urlList;
}
public static void main(String[] args) {
List<String> strList=new ArrayList<String>();
strList.add("http://www.xxx/exclusive/?pindao="+Pindao.getKey("国际"));
//滚动新闻
strList.add("http://www.xxx/article/?pindao="+Pindao.getKey("国际"));
for(String str:strList){
Spider.create(new HQNewsTaskDao()).addUrl(str).thread(1).run();
}
}
//所有频道Action
public static void runNewsList(List<String> strList){
for(String str:strList){
Spider.create(new HQNewsTaskDao()).addUrl(str).thread(1).run();
}
}
}

四、定时抓取，配置如下：
1、web.xml重配置监听

[java] view plain copy

<listener>
<listener-class>com.spider.utils.AutoRun</listener-class>
</listener>

2、定时代码

[java] view plain copy

package com.spider.utils;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
import javax.servlet.ServletContextEvent;
import javax.servlet.ServletContextListener;
import com.spider.huanqiu.timer.HQJob1;
import com.spider.huanqiu.timer.HQJob2;
import com.spider.huanqiu.timer.HQJob3;
import com.spider.huanqiu.timer.HQJob4;
import com.spider.huasheng.timer.HSJob1;
import com.spider.huasheng.timer.HSJob2;
/**
* 描述：监听增量抓取Job
* 创建时间：2016-11-4
* @author Jibaole
*/
public class AutoRun implements ServletContextListener {
public void contextInitialized(ServletContextEvent event) {
ScheduledExecutorService scheduExec = Executors.newScheduledThreadPool(6);
/*
* 这里开始循环执行 HSJob()方法了
* schedule(param1, param2,param3)这个函数的三个参数的意思分别是：
* param1：你要执行的方法；param2：延迟执行的时间，单位毫秒；param3：循环间隔时间，单位毫秒
*/
scheduExec.scheduleAtFixedRate(new HSJob1(), 1*1000*60,1000*60*10,TimeUnit.MILLISECONDS); //延迟1分钟，设置没10分钟执行一次
scheduExec.scheduleAtFixedRate(new HSJob2(), 3*1000*60,1000*60*10,TimeUnit.MILLISECONDS); //延迟3分钟，设置没10分钟执行一次
scheduExec.scheduleAtFixedRate(new HQJob1(), 5*1000*60,1000*60*10,TimeUnit.MILLISECONDS); //延迟5分钟，设置没10分钟执行一次
scheduExec.scheduleAtFixedRate(new HQJob2(), 7*1000*60,1000*60*10,TimeUnit.MILLISECONDS); //延迟7分钟，设置没10分钟执行一次
scheduExec.scheduleAtFixedRate(new HQJob3(), 9*1000*60,1000*60*14,TimeUnit.MILLISECONDS); //延迟9分钟，设置没10分钟执行一次
scheduExec.scheduleAtFixedRate(new HQJob4(), 11*1000*60,1000*60*10,TimeUnit.MILLISECONDS); //延迟11分钟，设置没10分钟执行一次
}
public void contextDestroyed(ServletContextEvent event) {
System.out.println("=======timer销毁==========");
//timer.cancel();
}
}

3、具体执行业务(举一个例子)

[java] view plain copy

package com.spider.huasheng.timer;
import java.util.ArrayList;
import java.util.List;
import java.util.TimerTask;
import com.spider.huasheng.task.HSTaskDao;
import com.spider.huasheng.task.HSTaskDao1;
import com.spider.huasheng.task.HSTaskDao2;
/**
* 描述：国际、社会、国内、评论等频道定时任务
* 创建时间：2016-11-9
* @author Jibaole
*/
public class HSJob1 implements Runnable{
@Override
public void run() {
System.out.println("======>>>开始：xxx-任务1====");
try {
runNews();
runNews1();
runNews2();
} catch (Throwable t) {
System.out.println("Error");
}
System.out.println("======xxx-任务1>>>结束！！！====");
}
/**
* 抓取-新闻频道列表
*/
public void runNews(){
List<String> strList=new ArrayList<String>();
/**##############>>>16、国际<<<##################*/
//国际视野
strList.add("http://xxx/class/2199.html?pindao=国际");
/**##############>>>17、社会<<<##################*/
//社会
strList.add("http://xxx/class/2200.html?pindao=社会");
/**##############>>>18、国内<<<##################*/
//国内动态
strList.add("http://xxx/class/1922.html?pindao=国内");
HQNewsTaskDao.runNewsList(strList);
}
/**
* 抓取-新闻频道列表
*/
public void runNews1(){
List<String> strList=new ArrayList<String>();
/**##############>>>19、评论<<<##################*/
//华声视点
strList.add("http://xxx/class/709.html?pindao=评论");
//财经观察
strList.add("http://xxx/class/2557.html?pindao=评论");
/**##############>>>20、军事<<<##################*/
//军事
strList.add("http://xxx/class/2201.html?pindao=军事");
HQNewsTaskDao.runNewsList(strList);
}
/**
* 抓取-新闻频道列表
*/
public void runNews2(){
List<String> strList=new ArrayList<String>();
/**##############>>>24、财经<<<##################*/
//财讯
strList.add("http://xxx/class/2353.html?pindao=财经");
//经济观察
strList.add("http://xxx/class/2348.html?pindao=财经");
/**##############>>>30、人文<<<##################*/
//历史上的今天
strList.add("http://xxx/class/1313.html?pindao=人文");
//正史风云
strList.add("http://xxx/class/1362.html?pindao=人文");
HSTaskDao2.runNewsList(strList);
}
}

五、使用到的工具类

1、HttpClientUtil工具类

[java] view plain copy

package com.spider.utils;
import java.io.BufferedReader;
import java.io.File;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpStatus;
import org.apache.commons.httpclient.methods.GetMethod;
import org.apache.http.HttpEntity;
import org.apache.http.NameValuePair;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.conn.ssl.DefaultHostnameVerifier;
import org.apache.http.conn.util.PublicSuffixMatcher;
import org.apache.http.conn.util.PublicSuffixMatcherLoader;
import org.apache.http.entity.ContentType;
import org.apache.http.entity.StringEntity;
import org.apache.http.entity.mime.MultipartEntityBuilder;
import org.apache.http.entity.mime.content.FileBody;
import org.apache.http.entity.mime.content.StringBody;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.util.EntityUtils;
public class HttpClientUtil {
private final static String charset = "UTF-8";
private RequestConfig requestConfig = RequestConfig.custom().setSocketTimeout(15000)
.setConnectTimeout(15000)
.setConnectionRequestTimeout(15000)
.build();
private static HttpClientUtil instance = null;
private HttpClientUtil(){}
public static HttpClientUtil getInstance(){
if (instance == null) {
instance = new HttpClientUtil();
}
return instance;
}
/**
* 发送 post请求
* @param httpUrl 地址
*/
public String sendHttpPost(String httpUrl) {
HttpPost httpPost = new HttpPost(httpUrl);// 创建httpPost
return sendHttpPost(httpPost);
}
/**
* 发送 post请求
* @param httpUrl 地址
* @param params 参数(格式:key1=value1&key2=value2)
*/
public String sendHttpPost(String httpUrl, String params) {
HttpPost httpPost = new HttpPost(httpUrl);// 创建httpPost
try {
//设置参数
StringEntity stringEntity = new StringEntity(params, "UTF-8");
stringEntity.setContentType("application/x-www-form-urlencoded");
httpPost.setEntity(stringEntity);
} catch (Exception e) {
e.printStackTrace();
}
return sendHttpPost(httpPost);
}
/**
* 发送 post请求
* @param httpUrl 地址
* @param maps 参数
*/
public String sendHttpPost(String httpUrl, Map<String, String> maps) {
HttpPost httpPost = new HttpPost(httpUrl);// 创建httpPost
httpPost.setHeader("Content-Type","application/x-www-form-urlencoded;charset="+charset);
httpPost.setHeader("User-Agent","Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.");
// 创建参数队列
List<NameValuePair> nameValuePairs = new ArrayList<NameValuePair>();
for (String key : maps.keySet()) {
nameValuePairs.add(new BasicNameValuePair(key, maps.get(key)));
}
try {
httpPost.setEntity(new UrlEncodedFormEntity(nameValuePairs, "UTF-8"));
} catch (Exception e) {
e.printStackTrace();
}
return sendHttpPost(httpPost);
}
/**
* 发送 post请求（带文件）
* @param httpUrl 地址
* @param maps 参数
* @param fileLists 附件
*/
public String sendHttpPost(String httpUrl, Map<String, String> maps, List<File> fileLists) {
HttpPost httpPost = new HttpPost(httpUrl);// 创建httpPost
MultipartEntityBuilder meBuilder = MultipartEntityBuilder.create();
for (String key : maps.keySet()) {
meBuilder.addPart(key, new StringBody(maps.get(key), ContentType.TEXT_PLAIN));
}
for(File file : fileLists) {
FileBody fileBody = new FileBody(file);
meBuilder.addPart("files", fileBody);
}
HttpEntity reqEntity = meBuilder.build();
httpPost.setEntity(reqEntity);
return sendHttpPost(httpPost);
}
/**
* 发送Post请求
* @param httpPost
* @return
*/
private String sendHttpPost(HttpPost httpPost) {
CloseableHttpClient httpClient = null;
CloseableHttpResponse response = null;
HttpEntity entity = null;
String responseContent = null;
try {
// 创建默认的httpClient实例.
httpClient = HttpClients.createDefault();
httpPost.setConfig(requestConfig);
// 执行请求
response = httpClient.execute(httpPost);
entity = response.getEntity();
responseContent = EntityUtils.toString(entity, "UTF-8");
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
// 关闭连接,释放资源
if (response != null) {
response.close();
}
if (httpClient != null) {
httpClient.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
return responseContent;
}
/**
* 发送 get请求
* @param httpUrl
*/
public String sendHttpGet(String httpUrl) {
HttpGet httpGet = new HttpGet(httpUrl);// 创建get请求
return sendHttpGet(httpGet);
}
/**
* 发送 get请求Https
* @param httpUrl
*/
public String sendHttpsGet(String httpUrl) {
HttpGet httpGet = new HttpGet(httpUrl);// 创建get请求
return sendHttpsGet(httpGet);
}
/**
* 发送Get请求
* @param httpPost
* @return
*/
private String sendHttpGet(HttpGet httpGet) {
CloseableHttpClient httpClient = null;
CloseableHttpResponse response = null;
HttpEntity entity = null;
String responseContent = null;
try {
// 创建默认的httpClient实例.
httpClient = HttpClients.createDefault();
httpGet.setConfig(requestConfig);
// 执行请求
response = httpClient.execute(httpGet);
entity = response.getEntity();
responseContent = EntityUtils.toString(entity, "UTF-8");
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
// 关闭连接,释放资源
if (response != null) {
response.close();
}
if (httpClient != null) {
httpClient.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
return responseContent;
}
/**
* 发送Get请求Https
* @param httpPost
* @return
*/
private String sendHttpsGet(HttpGet httpGet) {
CloseableHttpClient httpClient = null;
CloseableHttpResponse response = null;
HttpEntity entity = null;
String responseContent = null;
try {
// 创建默认的httpClient实例.
PublicSuffixMatcher publicSuffixMatcher = PublicSuffixMatcherLoader.load(new URL(httpGet.getURI().toString()));
DefaultHostnameVerifier hostnameVerifier = new DefaultHostnameVerifier(publicSuffixMatcher);
httpClient = HttpClients.custom().setSSLHostnameVerifier(hostnameVerifier).build();
httpGet.setConfig(requestConfig);
// 执行请求
response = httpClient.execute(httpGet);
entity = response.getEntity();
responseContent = EntityUtils.toString(entity, "UTF-8");
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
// 关闭连接,释放资源
if (response != null) {
response.close();
}
if (httpClient != null) {
httpClient.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
return responseContent;
}
/**
* 利用httpClient获取页面
* @param url
* @return
*/
public static String getPage(String url){
String result="";
HttpClient httpClient = new HttpClient();
GetMethod getMethod = new GetMethod(url+"?date=" + new Date().getTime());//加时间戳，防止页面缓存
try {
int statusCode = httpClient.executeMethod(getMethod);
httpClient.setTimeout(5000);
httpClient.setConnectionTimeout(5000);
if (statusCode != HttpStatus.SC_OK) {
System.err.println("Method failed: "+ getMethod.getStatusLine());
}
// 读取内容
//byte[] responseBody = getMethod.getResponseBody();
BufferedReader reader = new BufferedReader(new InputStreamReader(getMethod.getResponseBodyAsStream()));
StringBuffer stringBuffer = new StringBuffer();
String str = "";
while((str = reader.readLine())!=null){
stringBuffer.append(str);
}
// 处理内容
result = stringBuffer.toString();
} catch (Exception e) {
System.err.println("页面无法访问");
}
getMethod.releaseConnection();
return result;
}
}

2、下载图片方法

[java] view plain copy

/**
* 下载图片到本地
* @param picUrl 图片Url
* @param localPath 本地保存图片地址
* @return
*/
public String downloadPic(String picUrl,String localPath){
String filePath = null;
String url = null;
try {
URL httpurl = new URL(picUrl);
String fileName = getFileNameFromUrl(picUrl);
filePath = localPath + fileName;
File f = new File(filePath);
FileUtils.copyURLToFile(httpurl, f);
Function fun = new Function();
url = filePath.replace("/www/web/imgs", fun.getProValue("IMG_PATH"));
} catch (Exception e) {
logger.info(e);
return null;
}
return url;
}

1、替换咨询内容图片方法

[java] view plain copy

/**
* 替换内容中图片地址为本地地址
* @param content html内容
* @param pic_dir 本地地址文件路径
* @return html内容
*/
public static String replaceForNews(String content,String pic_dir){
String str = content;
String cont = content;
while (true) {
int i = str.indexOf("src=\"");
if (i != -1) {
str = str.substring(i+5, str.length());
int j = str.indexOf("\"");
String pic_url = str.substring(0, j);
//下载图片到本地并返回图片地址
String pic_path = fun.downloadPicForNews(pic_url,pic_dir);
if(StringUtils.isNotEmpty(pic_url) && StringUtils.isNotEmpty(pic_path)){
cont = cont.replace(pic_url, pic_path);
str = str.substring(j,str.length());
}
} else{
break;
}
}
return cont;
}

[java] view plain copy

/**
* 下载图片到本地
* @param picUrl 图片Url
* @param localPath 本地保存图片地址
* @return
*/
public String downloadPicForNews(String picUrl,String localPath){
String filePath = "";
String url = "";
try {
URL httpurl = new URL(picUrl);
HttpURLConnection urlcon = (HttpURLConnection) httpurl.openConnection();
urlcon.setReadTimeout(3000);
urlcon.setConnectTimeout(3000);
int state = urlcon.getResponseCode(); //图片状态
if(state == 200){
String fileName = getFileNameFromUrl(picUrl);
filePath = localPath + fileName;
File f = new File(filePath);
FileUtils.copyURLToFile(httpurl, f);
Function fun = new Function();
url = filePath.replace("/www/web/imgs", fun.getProValue("IMG_PATH"));
}
} catch (Exception e) {
logger.info(e);
return null;
}
return url;
}

获取文件名称，根绝时间戳自定义

[java] view plain copy

/**
* 根据url获取文件名
* @param url
* @return 文件名
*/
public static String getFileNameFromUrl(String url){
//获取后缀
String sux = url.substring(url.lastIndexOf("."));
if(sux.length() > 4){
sux = ".jpg";
}
int i = (int)(Math.random()*1000);
//随机时间戳文件名称
String name = new Long(System.currentTimeMillis()).toString()+ i + sux;
return name;
}

五、遇到的坑
1、增量抓取经常遇到这2个异常，如下
抓取超时：Jsoup 获取页面内容，替换为 httpclient获取，Jsoup去解析

页面gzip异常（这个问题特别坑，导致历史、增量抓取数据严重缺失，线上一直有问题）

解决方案：

增加：Site..addHeader("Accept-Encoding", "/")

这个是WebMagic的框架源码有点小Bug,如果没有设置Header，默认页面Accept-Encoding为：gzip

2、定时抓取
由ScheduledExecutorService多线程并行执行任务，替换Timer单线程串行

原方式代码，如下：

[java] view plain copy

package com.spider.utils;
import java.util.Timer;
import javax.servlet.ServletContextEvent;
import javax.servlet.ServletContextListener;
import com.spider.huanqiu.timer.HQJob1;
import com.spider.huanqiu.timer.HQJob2;
import com.spider.huanqiu.timer.HQJob3;
import com.spider.huanqiu.timer.HQJob4;
import com.spider.huasheng.timer.HSJob1;
import com.spider.huasheng.timer.HSJob2;
/**
* 描述：监听增量抓取Job
* 创建时间：2016-11-4
* @author Jibaole
*/
public class AutoRun implements ServletContextListener {
//HS-job
private Timer hsTimer1 = null;
private Timer hsTimer2 = null;
//HQZX-job
private Timer hqTimer1 = null;
private Timer hqTimer2 = null;
private Timer hqTimer3 = null;
private Timer hqTimer4 = null;
public void contextInitialized(ServletContextEvent event) {
hsTimer1 = new Timer(true);
hsTimer2 = new Timer(true);
hqTimer1 = new Timer(true);
hqTimer2 = new Timer(true);
hqTimer3 = new Timer(true);
hqTimer4 = new Timer(true);
/*
* 这里开始循环执行 HSJob()方法了
* schedule(param1, param2,param3)这个函数的三个参数的意思分别是：
* param1：你要执行的方法；param2：延迟执行的时间，单位毫秒；param3：循环间隔时间，单位毫秒
*/
hsTimer1.scheduleAtFixedRate(new HSJob1(), 1*1000*60,1000*60*10); //延迟1分钟，设置没10分钟执行一次
hsTimer2.scheduleAtFixedRate(new HSJob2(), 3*1000*60,1000*60*10); //延迟3分钟，设置没10分钟执行一次
hqTimer1.scheduleAtFixedRate(new HQJob1(), 5*1000*60,1000*60*10); //延迟5分钟，设置没10分钟执行一次
hqTimer2.scheduleAtFixedRate(new HQJob2(), 7*1000*60,1000*60*10); //延迟7分钟，设置没10分钟执行一次
hqTimer3.scheduleAtFixedRate(new HQJob3(), 9*1000*60,1000*60*10); //延迟9分钟，设置没10分钟执行一次
hqTimer4.scheduleAtFixedRate(new HQJob4(), 11*1000*60,1000*60*10); //延迟11分钟，设置没10分钟执行一次
}
public void contextDestroyed(ServletContextEvent event) {
System.out.println("=======timer销毁==========");
//timer.cancel();
}
}

3、定时多个任务时，使用多线程，遇到某个线程抛异常终止任务

解决方案：在多线程run()方法里面，增加try{}catch{}

4、通过HttpClient定时获取页面内容时，页面缓存，抓不到最新内容

解决方案：在工具类请求URL地址后面增加：url+"?date=" + new Date().getTime()

六、一些方面的处理
1、页面抓取规则调整
     先抓列表，在抓内容；改为抓取列表的同时，需要获取内容详情
2、保存数据方式作调整
      先抓取标题等概要信息，保存数据库，然后，更新内容信息，根据业务需求再删除一些非来源文章(版权问题)；改为：直接控制来源，得到完整数据，再做批量保存；
3、页面有一个不想要的内容，处理方法
       注释、JS代码、移除无用标签块

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

网络爬虫

webmagic

Java爬虫框架WebMagic的使用总结的相关文章

Python 爬虫获取某贴吧所有成员用户名

最近想用Python爬虫搞搞百度贴吧的操作所以我得把原来申请的小号找出来用有一个小号我忘了具体ID 只记得其中几个字母以及某个加入的贴吧所以今天就用爬虫来获取C语言贴吧的所有成员计划很简单爬百度贴吧的会员页面把结果存到MySQL
9个爬虫基础实战汇总+4个专业爬虫练手站推荐

个人主页互联网阿星格言选择有时候会大于努力但你不努力就没得选作者简介大家好我是互联网阿星和我一起合理使用Python 努力做时间的主人如果觉得博主的文章还不错的话请点赞收藏留言支持一下博主哦行业资料 PPT模板简
【python爬虫】7.爬到的数据存到哪里？

文章目录前言存储数据的方式存储数据的基础知识基础知识 Excel写入与读取基础知识 csv写入与读取项目存储周杰伦的歌曲信息复习前言上一关我们以QQ音乐为例主要学习了如何带参数地请求数据 get请求和Request
如何使用webmagic发送post请求，并解析传回的JSON

以浙江法院公开网的送达公告数据为例 http www zjsfgkw cn TrialProcess NoticeSDList 1 分析页面看到参数有3个分别是cbfy pageno和pagesize 传回来的数据是以json形式存在
【python爬虫】1.爬虫基础知识

文章目录前言初识爬虫什么是爬虫为什么需要爬虫明晰路径浏览器的工作原理爬虫的工作原理体验爬虫 requests get Response对象的常用属性爬虫伦理总结前言很高兴能在这里遇见你我将会带你学习网络爬虫我们可
用Python进行网络爬虫和数据分析的初次尝试（一）

某天突然起了兴致想知道国内每个成员有多少粉但不知道应该怎么做各个应援会论坛的用户数很多粉也去其它成岩的应援论坛不靠谱各个贴吧会员数这只能说明某个成员是否出名并且也有和应援论坛相同的弊端最好的方法是在一个中立无成员倾向的
从在线字典网站获取词汇释义：一个Python爬虫实战案例

目录目录 1 准备工作 2 分析网页结构 3 编写爬虫 4 提取单词信息 5 输出结果
python实战-JSON形式爬虫-批量爬取图片并下载

文章目录一前言二思路 1 网站返回内容 2 url分页结构 3 根据请求快速构造代码三具体代码的实现四总结一前言上一篇文章已经对html形式的爬虫进行实战批量爬取电影下载链接接下来将实战json形式爬虫批量爬取并
使用多线程或异步技术提高图片抓取效率

导语图片抓取是爬虫技术中常见的需求但是图片抓取的效率受到很多因素的影响比如网速网站反爬机制图片数量和大小等本文将介绍如何使用多线程或异步技术来提高图片抓取的效率以及如何使用爬虫代理IP来避免被网站封禁概述多线程和异步技术都
requests.exceptions.SSLError: HTTPSConnectionPool用python挂代理爬国外网站报错

我想要爬取国外网站信息但是由于需要长时间挂梯子进行一些交互因此用python爬取数据时就会报以下错误我找了好多解决办法但大多数都不是由于挂代理而引起的这里我只针对挂代理进行爬取而出现这种报错提供一个解决方法在代码里加入协议改一下
第14.6节使用Python urllib.request模拟浏览器访问网页的实现代码

Python要访问一个网页并读取网页内容非常简单在利用第14 5节利用浏览器获取的http信息构造Python网页访问的http请求头的方法构建了请求http报文的请求头情况下使用urllib包的request模块使得这项工作变得
python爬取唯品会商品信息

目录 1 明确需求和抓包思路 2 发送请求获取数据 3 解析数据 4 保存数据 5 最终效果 1 明确需求和抓包思路目标爬取唯品会中泳衣的商品信息思路点击F12打开抓包工具 gt 刷新页面 gt 搜索关键字找到我们想要的数据包并
python编程入门书-最适合Python初学者的6本书籍推荐「必须收藏」

Python是一种通用的解释型编程主要用于Web开发机器学习和复杂数据分析 Python对初学者来说是一种完美的语言因为它易于学习和理解随着这种语言的普及 Python程序员的机会也越来越大如果你想学习Python编程市场上就有
大数据采集概述

文章目录大数据采集概述 1 互联网大数据与采集 1 1互联网大数据来源 1 社交媒体 2 社交网络 3 百科知识库 4 新闻网站 5 评论信息 6 位置型信息 1 2 互联网大数据的特征 1 大数据类型和语义更加丰富 2 数据的规范化程度
python爬虫程序之百度翻译,pyexecjs模块的用法（python里的js解析库）

目录百度翻译爬虫程序 1 需求分析 2 URL分析 3 难点请求参数分析 4 如何生成sign值 5 pyexecjs模块 6 程序设计 7 程序改进思路 pyexecjs模块是python爬虫库里关于javaScript的一套程序它
Python爬虫实战之电影爬取过程

俗话说兴趣所在方能大展拳脚 so结合兴趣的学习才能事半功倍更加努力专心 apparently本次任务是在视频网站爬取一些好看的小电影地址不放狗头保命只记录过程实现功能从网站上爬取采用m3u8分段方式的视频文件对加密的 ts
程序员教你如何用Python爬取付费小说

小说相信大家都爱看吧一章接一章具有极大的吸引力看了还想看当然付费小说价格也不便宜看到一半突然收费猝不及防在我们程序员这里收费是不存在的万物皆可爬什么是网络爬虫网络爬虫又被称为网页蜘蛛网络机器人在FOAF社区中间更经
拼多多商品价格监控自动化API接口获取拼多多商品详情数据API接口

随着电子商务的飞速发展越来越多的人选择在网上购物在这个充满竞争的市场中拼多多以其独特的商业模式和创新的营销手段迅速崛起成为中国领先的电商平台之一为了更好地满足消费者的需求拼多多提供了丰富的API接口使得开发者可以方便地获取商品
1688买家API接口跨境卖家需要的API接口

1688作为深耕产业带多年的数字供应链平台近两年不仅在年轻消费群体中热度飙升在跨境侧也有不俗表现 11月19日 1688总裁余涌在1688跨境寻源通计划发布会上透露 1688平台拥有100万的源头厂商每年服务6500万的B类买家 20
【打造优质CSDN热榜评论区】让AI给评论打分！

大家好啊我是豆小匠 1 专栏背景作为CSDN的老用户自从CSDN强调要打造优质评论区后热榜的评论区仍旧有进步空间因此在这个专栏会结合AI 探索一些方法来提高评论区的质量这个专栏仅为博主的想法作为技术学习使用与官方无关联 2

随机推荐

C#

文章目录简介方法备注简介 unity创建一个相机想保存相机的图像时有这样的一段程序 void Awake GetComponent lt gt 物件属性 snapCam GetComponent
代码随想录算法训练营19期第43天

1049 最后一块石头的重量 II 视频讲解动态规划之背包问题这个背包最多能装多少 LeetCode 1049 最后一块石头的重量II 哔哩哔哩 bilibili 代码随想录初步思路动态规划总结套用01背包 dp j max d
python 两个数值互换（一句代码搞定）

a sire b 23 a b b a print a print b
springboot本机启动elasticjob抛出异常HostException(ip is null)

1 使用的elasticjob版本为3 0 1 2 本机的IPV4在校验isReachable 返回false 可能是使用无线网导致ip验证问题 3 最后引入Groovy解决引入包
Oracle 视图中出现重复记录（left join）

Oracle 视图中出现重复记录问题解决办法注意问题今天做项目的时候客户反映页面中出现了重复的数据经排查后发现前短数据新增的字段来自于应该新的表当时是直接使用 left join 左连接的方式对数据进行拼接的 left j
（已解决）DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`.

DeprecationWarning np float is a deprecated alias for the builtin float To silence this warning use float by itself Doin
upload-labs-1

打开第一关通过查看源码我们可以发现第一关属于前端验证我们可以将浏览器JS代码禁用掉禁用JavaScript
[图形学] 《Real-Time Rendering》碰撞检测（二）

reference Real Time Rendering 目录 17 前言 17 1 和射线的碰撞检测 17 2 使用BSP树的动态碰撞检测 17 3 一般层次的碰撞检测 17 3 1 分层的构建 17 3 2 不同层之间的碰撞检测 17
验证码图片实现

使用验证码进行验证自动生成验证码后台实现 package common makeCertPic import java awt Color import java awt Font import java awt Graphics im
第15课：生活中的命令模式——大闸蟹，走起

用程序来模拟生活从剧情中思考命令模式命令模式命令模式的模型抽象代码框架类图模型说明实战应用应用场景故事剧情 David 听说阿里开了一家实体店盒马鲜生特别火爆明天就周末了我们一起去吃大闸蟹吧 Tony 吃货真是味
前端三剑客---HTML&CSS&JavaScript

HTML CSS JavaScript 1 HTML 1 1 介绍 1 2 快速入门 1 3 基础标签 1 3 1 标题标签 1 3 2 hr标签 1 3 3 字体标签 1 3 4 换行标签 1 3 5 段落标签 1 3 6 加粗斜体下
diagnose-tools 编译报错

在 Ubuntu 20 04 4 LTS 环境中编译diagnose tools 执行make deps时报错 checking whether gcc m32 makes executables we can run no config
ValueError: Buffer dtype mismatch, expected ‘unsigned char‘ but got ‘long‘

在使用pydensecrf进行densecrf时出现ValueError def dense crf img probs n labels 2 h probs shape 0 w probs shape 1 probs np expand
NBA GLOSSARY

NBA 全称 National Basketball Association 美国国家篮球协会 DRAFT draft dr ft n 选秀 R1 Round one 第一轮 St Vincent St Mary HS OH Saint V
electron-builder打包过程中报错——网络下载篇（转）

在electron使用electron builder打包过程中需要用到几个github上的包但是由于网络原因会科学上网的同学基本不用看了下载不下来导致出错一 electron v8 2 0 win32 x64 zip 如下图导
MCP4725介绍和STM32模拟IC2驱动

一 MCP4725 简单总结为下面几个特点 1路DAC输出 12位分辨率 I2C 接口标准快速高速支持供电电压2 7 5 5 内部EEPROM存储设置 I2C地址可配置 A0 A1 A2内置默认为 00 二硬件设计 MCP472
torch 测试GPU能否正常使用

运行程序 import torch print torch cuda is available num gpu 1 Decide which device we want to run on device torch device cuda
介绍D3DPOOL和Lock

介绍D3DPOOL和Lock 分类 DirectX 2013 02 28 00 21 322人阅读评论 0 收藏举报 D3D RUTIME的内存类型分为3种 VIDEO MEMORY VM AGP MEMORY AM 和SYSTEM
最能感动女人的十大瞬间

拉着手在街上闲逛忽然之间他将她拽停伸手轻轻地将眼睑下的一根睫毛拨开她顿感幸福拨走睫毛不过是弹指之间的小事却充分说明他对她的注意力100 集中要不是他喜欢仔细地偷看她怎能发现刚跌落的一根细小睫毛没有一个女人能够抵抗男人如此
Java爬虫框架WebMagic的使用总结

最近项目做一个公司新闻网站分为PC 移动端 h5 数据来源是从HSZX与huanqiu2个网站爬取主要使用Java编写的WebMagic作为爬虫框架数据分为批量抓取增量抓取批量抓当前所有历史数据增量需要每10分钟定时抓取一次

Java爬虫框架WebMagic的使用总结

Java爬虫框架WebMagic的使用总结 的相关文章

随机推荐

热门标签

Java爬虫框架WebMagic的使用总结的相关文章