PHPCrawler抓取酷狗精选集歌单

2023-05-16

一、PHPCrawler的介绍与安装

先了解一下什么是抓取？

抓取就是网络爬虫，也就是人们常说的网络蜘蛛（spider）。是搜索引擎的一个重要组成部分，按照一定的逻辑和算法抓取和下载互联网上的信息和网页。一般的爬虫从一个start url开始，按照一定的策略开始爬取，把爬取到的新的url放入爬取队列中，然后进行新一轮的爬取，直到抓取完毕为止。

PHPCrawler是一个国外开源的爬虫系统，它的源码托管在sourceforge里，这是它的下载地址：点击打开链接

，根据自己电脑里安装的PHP版本选择合适的版本下载。下载完毕之后，解压到服务器网站根目录下，复制example.php文件，并重命名。

二、完整源码

<?php

// It may take a whils to crawl a site ...
set_time_limit(10000);

// Inculde the phpcrawl-mainclass
include("libs/PHPCrawler.class.php");

// Extend the class and override the handleDocumentInfo()-method 
class MyCrawler extends PHPCrawler 
{
  //在这里解析页面内容
  function handleDocumentInfo($DocInfo) 
  {
    // Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").
    if (PHP_SAPI == "cli") $lb = "\n";
    else $lb = "<br />";

    // Print the URL and the HTTP-status-Code
    echo "Page requested: ".$DocInfo->url." (".$DocInfo->http_status_code.")".$lb;
    
    // Print the refering URL
    echo "Referer-page: ".$DocInfo->referer_url.$lb;
    
    // Print if the content of the document was be recieved or not
    if ($DocInfo->received == true)
      echo "Content received: ".$DocInfo->bytes_received." bytes".$lb;
    else
      echo "Content not received".$lb; 
    
    // Now you should do something with the content of the actual
    // received page or file ($DocInfo->source), we skip it in this example 
    //echo $DocInfo->source;
    //echo $lb;
    $url=$DocInfo->url;
    $pat="/http:\/\/www\.kugou\.com\/yy\/special\/single\/\d+\.html/";
    if(preg_match($pat,$url)>0){
      $this->parseSonglistDetail($DocInfo);
    }
    flush();
  } 

  public function parseSonglistDetail($DocInfo){
        
        $songlistArr=array();
        $songlistArr['raw_url']=$DocInfo->url;
        $content=$DocInfo->content;
        //名称
        $matches=array();
        $pat="/<span>名称：<\/span>([^(<br)]+)<br \/>/";
        $res=preg_match($pat, $content,$matches);
        if($res>0){
          $songlistArr['title']=$matches[1];
        }else{
          $songlistArr['title']="";
          print "error:get title fail<br/>";
        }
        //创建人
        $matches=array();
        $pat="/<span>创建人：<\/span>([^(<br)]+)<br \/>/";
        $res=preg_match($pat, $content,$matches);
        if($res>0){
          $songlistArr['creator']=$matches[1];
        }else{
          $songlistArr['creator']="";
          print "error:get creator fail<br/>";
        }
        //创建时间
        $matches=array();
        $pat="/<span>更新时间：<\/span>([^(<br)]+)<br \/>/";
        $res=preg_match($pat, $content,$matches);
        if($res>0){
          $songlistArr['create_date']=$matches[1];
        }else{
          $songlistArr['create_date']="";
          print "error:get create_date fail<br/>";
        }
        //简介
        $matches=array();
        $pat="/<span>简介：<\/span>([^(<\/p)]*)<\/p>/";
        $res=preg_match($pat, $content,$matches);
        if($res>0){
          $songlistArr['info']=$matches[1];
        }else{
          $songlistArr['info']="";
          print "error:get info fail<br/>";
        }
        //歌曲
        $matches=array();
        $pat="/<a title=\"([^\"]+)\" hidefocus=\"/";
        $res=preg_match_all($pat, $content,$matches);
        
        if($res>0){
          $songlistArr['songs']=array();
          for($i=0;$i<count($matches[1]);$i++){
            $song_title=$matches[1][$i];
            array_push($songlistArr['songs'],array('title'=>$song_title));
          }
        }else{
          $songlistArr['song']="";
          print "error:get song fail<br/>";
        }
        
        echo "<pre>";
        print_r($songlistArr);
        echo "</pre>";
        $this->saveSonglist($songlistArr);
  }


  public function saveSonglist($songlistArr){
    //连接数据库
    $conn=mysql_connect("localhost","root","root");
    mysql_select_db("songlist",$conn);
    mysql_query("set names utf8");
    $songlist=array();
    $songlist['title']=mysql_escape_string($songlistArr['title']);
    $songlist['create_time']=mysql_escape_string($songlistArr['create_date']);
    $songlist['creator']=mysql_escape_string($songlistArr['creator']);
    $songlist['raw_url']=mysql_escape_string($songlistArr['raw_url']);
    $songlist['info']=mysql_escape_string($songlistArr['info']);
    $sql="insert into songlist set".
    "title=''".$songlist['title']."'".
    ",creat_time=''".$songlist['create_time']."'".
    ",creator=''".$songlist['creator']."'".
    ",raw_url=''".$songlist['raw_url']."'".
    ",info=''".$songlist['info']."';";
    mysql_query($sql,$conn);
    $songlist_id=mysql_insert_id();
    foreach($songlistArr['songs'] as $song){
      $title=mysql_escape_string($song['title']);
      $sql="insert into song set title='".$title."'" .",songlist_id=".$songlist_id.";";
      mysql_query($sql);

    }
    mysql_close($conn);
  }
}

// Now, create a instance of your class, define the behaviour
// of the crawler (see class-reference for more options and details)
// and start the crawling-process.
//创建一个爬虫
$crawler = new MyCrawler();
//设置一个开始的连接
// URL to crawl
$start_url="www.kugou.com/yy/special/index/1-0-2.html";
$crawler->setURL($start_url);
//设置内容的类型
// Only receive content of files with content-type "text/html"
$crawler->addContentTypeReceiveRule("#text/html#");
//忽略图片，设置那些连接不需要下载

//每一个精选集的连接
$crawler->addURLFollowRule("#http://www\.kugou\.com/yy/special/single/\d+\.html# i");//i 忽略大小写
//精选集页面的链接 下一页
$crawler->addURLFollowRule("#http://www\.kugou\.com/yy/special/index/\d+-0-2.html# i");

// Ignore links to pictures, dont even request pictures
$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");

// Store and send cookie-data like a browser does
$crawler->enableCookieHandling(true);

// Set the traffic-limit to 1 MB (in bytes,
// for testing we dont want to "suck" the whole site)
//数据内容的容量，多少m，0是无限的
$crawler->setTrafficLimit(1000 * 1024);

// Thats enough, now here we go
$crawler->go();

// At the end, after the process is finished, we print a short
// report (see method getProcessReport() for more information)
$report = $crawler->getProcessReport();

if (PHP_SAPI == "cli") $lb = "\n";
else $lb = "<br />";
    
echo "Summary:".$lb;
echo "Links followed: ".$report->links_followed.$lb;
echo "Documents received: ".$report->files_received.$lb;
echo "Bytes received: ".$report->bytes_received." bytes".$lb;
echo "Process runtime: ".$report->process_runtime." sec".$lb; 
?>

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

PHPCrawler

抓取酷狗精选集歌单

PHPCrawler抓取酷狗精选集歌单的相关文章

计算机组成原理第二篇:总线 1.总线原理和意义

总线是什么总线是连接多个部件的信息传输线是各部件共享的传输介质总线可以传输的原理总线实际上是由许多传输线或通路组成每条线上保持的电平高低即是所传输的信号每条线可一位一位地传输二进制代码一串二进制代码可以在一段时间内逐一传输完成
主流深度学习算法简介

深度学习算法简介 1 深度学习主流算法包括 1 1 CNN 卷积神经网络卷积神经网络 xff08 CNN xff09 是最常见的深度学习方法之一自20 世纪80 年代后期以来 xff0c CNN 已应用于视觉识别与分类任务 xff0c
常见外贸英文术语(下)

很多从事外贸行业的人都会用Skype IntBell AntTone等网络电话和客户沟通 xff0c 但是有时候会因为一些外贸行业的专业英文术语闹出笑话今天就让我们来总结一些外贸常见英文术语 xff0c 让你和客户沟通更加顺畅 xff01
Phpstorm2018 使用破解补丁永久激活

1 安装phpstorm xff0c 安装包请自行官网下载 http www jetbrains com phpstorm download 2 下载JetbrainsCrack jar文件 xff0c 存放至你的phpstorm执行文件同
va_start 与 va_end用法

1 包含头文件 include lt stdarg h gt 2 使用方法参考 http www cnblogs com hanyonglu archive 2011 05 07 2039916 html include lt stdio
jetson-sd卡制作(批量烧写)

jetson系列如果使用sd卡开发 xff0c 开发完成后可以不用重新制作根文件系统拷贝目前的SD卡即可实现批量烧写一开发好的SD卡制作img文件 1 将sd卡插到PC端 xff0c 查看sd卡设备 eg dev sdd fdisk
结构体知识点

结构体的结构如下 xff1a span class token comment 关键字struct是数据类型说明符 xff0c 指出下面说明的是结构体类型 span span class token keyword struct span
SurfaceView 的一般绘制View用法（一）

前段时间写了不少关于自定义View相关的文章 xff0c 最近两个项目同时开工 xff0c 忙成狗了 xff0c 这不是不写博客的理由哈 xff0c 今晚写一篇关于SurfaceView相关的博客 xff0c 还是和以前一样 xff0c 今
WorkerMan实现Web通讯（使用Vue实现前端页面逻辑）

需要使用到的扩展 https github com tangbc vue virtual scroll list 最终效果对话框主页面 Main vue lt template gt lt div gt lt chat message
[VS][原创]vs2019新建项目提示未正确加载nvdapackage包

第一步 xff1a 确认cuda 43 cudnn以及驱动程序都安装了第二步 xff1a 从win11菜单栏搜索打开Developer Command Prompt for VS 2019 xff0c 注意要以管理员身份运行 xff0c
2016年linux c程序员和初学者不得错过的精品图书18册

1 图书名称零点起飞学Linux C编程图书信息陈冠军清华大学出版社 2013 09 01 2 图书名称 Linux C从入门到精通图书信息明日科清华大学出版社 2012 12 01 3 图书名称 Linux C编程一站式学习
面试造飞机系列：看架构师如何设计微服务接口

来源后端技术学堂责编 Carol 封图 CSDN下载于视觉中国在微服务设计中 xff0c 服务间接口通信设计常见的有两种方式 xff1a RPC 和 REST xff0c 关于微服务和 RPC 的更多细节 xff0c 可以参考我上一篇
C++ 数组（vector）常用操作总结

目录 1 vector对象的定义和初始化方式 2 vector 常用基础操作 3 使用迭代器的遍历插入删除操作 4 vector 元素的重排操作 xff08 排序逆序等 xff09 5 vector 中找最值 6 改变vector大小
ROS 节点初始化步骤、topic/service创建及使用

目录 1 节点初始化步骤 2 service 创建及使用 3 topic创建及使用 4 框架总结这是一个总结复盘的记录 1 节点初始化步骤在 mian 函数中使用 ros init 初始化节点 xff0c 注册节点名 xff0c 这里注
Docker+xrdp+understand

创建容器我使用的是Ubuntu18 04的镜像 xff0c 注意把3389端口映射出来 docker run it name understand p 3399 3389 ubuntu 18 04 安装xrdp协议 span class
如何下载Amazon页面产品视频

step1 右键查看网页源码 step2 ctrl 43 f查找 MP4 step3 xff1a 复制相关URL
高速信号参考GND平面和VCC平面，会带来不同的回路阻抗,影响信号质量

1 1参考平面作用参考平面的作用之一是就是给信号提供回流路径 xff0c 对于信号来说 xff0c 不管任何网络的铜平面都可以作为参考平面选择GND作为参考平面与其他网络的区别在于 xff0c 使用其他网络提供回流路径时 xff0c 网
如何理解防抖debounce和节流throttle

本人通过阅读网络上防抖和节流多篇相关的文章 xff0c 并借鉴相关案例进行说明防抖防抖就像人眨眼睛一样 xff0c 不可能不停的眨 xff0c 每隔一定时间眨一下 xff0c 防止眼睛干涩 xff0c 如果不停眨眼 xff0c 别人以为
1024快乐！浅谈scrollTop需要踩的坑

开始之前 xff0c 需要分清楚scrollTo和scrollTop 坑one 1 设置浏览器scrollTop 最好通过用户操作设置滚动 xff0c 或者setTimeout延时滚动 span class token comment Ch

随机推荐

如何写出高效的 Vue 代码

开讲之前我先简单的自我介绍一下 xff0c 本人自喻 xff1a flitrue xff0c 工作三年有余 xff0c 在一家不知名的互联网企业担任前端架构师之职技术选型问题近几年前端发展快速 xff0c 很多同学抱怨学不动了 xff0
深入了解Object.freeze()和Object.seal()

目录 Object freeze Object seal 对比Object freeze 和Object seal 拓展Object preventExtensions Object freeze 官方MDN对Object freeze 的
gulp和webpack的区别

基本区别 xff1a gulp可以进行js xff0c htm xff0c css xff0c img的压缩打包 xff0c 是自动化构建工具 xff0c 可以将多个js文件或是css压缩成一个文件 xff0c 并且可以压缩为一行 xff0
require和import的区别？exports和module.exports的区别？export和export default的区别？

CommonJs模块是运行时加载 xff0c ES6模块是编译时输出接口 require是commonjs规范中导入模块的语法 xff1b import是ES6规范中导入模块的语法 xff1b require支持动态导入 xff1b imp
nodejs中的EventLoop

span class token operator gt span timers span class token operator lt span 执行 span class token function setTimeout span
https加解密流程图
算法题：两个有序数组合并（最优解）

合并两个有序数组最优解时间复杂度 O n 空间复杂度 O n span class token comment 正向 span span class token keyword var span span class token func
linux 查看设备挂载信息

系统 uname a 查看内核操作系统 CPU信息 head n 1 etc issue 查看操作系统版本 cat proc cpuinfo 查看CPU信息 hostname 查看计算机名 lspci tv 列出所有PCI设备 lsusb
什么是高阶函数和纯函数？

高阶函数是一个接收函数作为参数或将函数作为输出返回的函数纯函数的三个条件 xff1a 给定输入 xff0c 无论什么时候调用 xff0c 无论调用多少次 xff0c 输出总是确定无疑的 xff1b 在函数内部不可以改变函数外部对象的状态
Pre-commit：如何使用 husky、lint-staged和prettier优化你的项目

在软件开发过程中 xff0c 代码风格检查 xff08 Code Linting xff09 是保障代码规范和一致性的有效手段过去 xff0c Lint 的工作一般在 Code Review 或者 CI 的时候进行 xff0c 但这样会导
2020: Vue和React生命周期

Vue 生命周期 vue2有9个生命周期钩子 vue3也有9个生命周期钩子 2 x和3 x钩子的对应关系 xff1a beforeCreate gt 使用 setup created gt 使用 setup beforeMount gt o
ESLint常用规范

off or 0 关闭规则 warn or 1 将规则视为一个警告 xff08 不会影响退出码 xff09 error or 2 将规则视为一个错误退出码为1 span class token string 34 no console 3
重写audio元素样式

span class token selector audio span span class token punctuation span span class token property display span span class
git clone --mirror -q git://github.com/adobe-webplatform/eve.git

解决办法 xff1a git全局添加一个属性 git config global url 34 https 34 insteadOf git 然后 xff0c 重新npm install
理解npm包管理机制

推荐文章 https segmentfault com q 1010000004114972 ea 61 496109 https blog csdn net azl397985856 article details 103982369
面试题：使用promise实现并发请求限制（最优解）

问题 xff1a 有 8 个图片资源的 url xff0c 已经存储在数组 urls 中 xff0c 而且已经有一个函数 function loadImg xff0c 输入一个 url 链接 xff0c 返回一个 Promise xff0c
PHP关于VC11,VC9,VC6以及Thread Safe和Non Thread Safe版本选择的问题

从PHP5 2 10版本开始 xff08 现在有PHP5 2 10和5 3两个版本 xff09 xff0c 有None Thread Safe与Thread Safe两种版本的可供选择 xff0c 这两种版本有何不同 xff0c 作为使用者
apache下载安装配置

最近从apache官网上下载了apache最新版本的压缩包httpd 2 4 18 x64 vc11 r3 zip xff0c 解压以后用cmd命令安装了好长时间都没有安装上 xff0c 在网上找各种解决方法 xff0c 都不靠谱 xff0
ubuntun无法安装 libsdl2-dev

sudo apt get install libsdl2 dev Reading package lists Done Building dependency tree Reading state information Done Some
PHPCrawler抓取酷狗精选集歌单

一 PHPCrawler的介绍与安装先了解一下什么是抓取 xff1f 抓取就是网络爬虫 xff0c 也就是人们常说的网络蜘蛛 xff08 spider xff09 是搜索引擎的一个重要组成部分 xff0c 按照一定的逻辑和算法抓取和下载互

PHPCrawler抓取酷狗精选集歌单

一、PHPCrawler的介绍与安装

二、完整源码

PHPCrawler抓取酷狗精选集歌单 的相关文章

随机推荐

热门标签

PHPCrawler抓取酷狗精选集歌单的相关文章