记录使用scrapy爬取新闻网站最新新闻存入MySQL数据库，每天定时爬取自动更新

news/2024/5/9 2:56:18/文章来源:https://blog.csdn.net/weixin_43857152/article/details/86071216

爬取每天更新的新闻，使用scrapy框架，Python2.7，存入MySQL数据库，将每次的爬虫日志和爬取过程中的bug信息存为log文件下。定义bat批处理文件，添加到计划任务程序中，自动爬取。
在这里插入图片描述
额…
1.在items文件中，定义需要爬取的类

2.在settings文件中设置默认项，设置日志输出格式，打开pipeline文件，设置delay时间，设置数据库信息，设置请求头等信息
3.编写自己的spider文件

class TouchuangSpider(scrapy.Spider):name = 'touchuang'allowed_domains = ['xunjk.com']url = {"1": "http://www.xunjk.com/xinwen/rongzi/",     # 融资"2": "http://www.xunjk.com/shangye/",           # 商业"3": "http://www.xunjk.com/xinwen/yanjiu/",      # 研究"4": "http://www.xunjk.com/xinwen/keji/",       # 科技"5": "http://www.xunjk.com/xinwen/jinrong/",    # 金融"6": "http://www.xunjk.com/xinwen/dongcha/",    # 洞察"7": "http://www.xunjk.com/xinwen/yejie/"       # 业界}start_urls = [url["1"], url["2"], url["3"], url["4"], url["5"], url["6"], url["7"]]# start_urls = [url["1"]]

因为同时爬取几个板块的新闻，将板块编号设置为字典k值，链接设置为v值。
访问url，回调prase()函数，进一步处理。
提取中用到的常用的xpath提取，这个没什么可说的

 def request_page(self,response):date = time.strftime("%Y%m%d")try:item = XinwenItem()item["title"] = response.xpath("//div[@class='main_c']/h1/text()").extract_first()      # 获取新闻标题item["zuozhe"] = response.xpath("//div[@class='infos']/span[@class='from']/a/text()").extract_first()      # 获取新闻来源page_url = response.xpath("//div[@class='breadnav']/a[3]/@href").extract_first()for k, v in self.url.items():       # 为获取新闻分类id，获取到当前页分类url作为字典v值，取得k值if v == page_url:item["fenlei_id"] = k   # k为文章分类id# 判断文章中是否有图片，有获取图片；无返回空item["created_at"] = response.xpath("//div[@class='infos']/span[@class='time']/text()").extract_first()except Exception as e:# 若报错，将错误打印txt返回with open(r"D:\pycharm_projects\xinwenyuan\xinwen\log\/"+date+".txt", "a+") as f:f.write("e:"+e+"\n")try:img = re.search('<img .* src="(.*?)" width="(.*?)" /></p>', response.text).group(1)item["news_pic"] = imgprint item["news_pic"]      except:item["news_pic"] = ""   # 无图片返回空

其中正文部分，刚开始用的是xpath提取文本信息，但考虑到有部分新闻都有图片信息，并且后期还要将图片和文本一一对应，所以改用正则匹配p标签获取正文，最后返回items
4.存入mysql数据库
在pipelines文件中编写sql信息

class XinwenPipeline(object):def __init__(self):print "connect successful..."# 链接MySQL数据库self.connect = pymysql.connect(host=settings.MYSQL_HOST,user=settings.MYSQL_USER,password=settings.MYSQL_PASSWD,db=settings.MYSQL_DBNAME,port=settings.MYSQL_PORT,charset="utf8")# 获取游标self.cursor = self.connect.cursor()# 存入数据库def process_item(self, item, spider):date = time.strftime("%Y%m%d")print "doing something..."try:# 执行sql语句插入，其中title设置为唯一字段，防止重复录入sql = '''insert into articles(title,zuozhe,content,fenlei_id,created_at,news_pic,updated_at,dianji) VALUES (%s,%s,%s,%s,%s,%s,%s,%s)'''self.cursor.execute(sql, (item["title"],item["zuozhe"],item["content"],item["fenlei_id"],item["created_at"],item["news_pic"], item["updated_at"], item["dianji"]))self.connect.commit()      # 保存except Exception as error:# 出现错误时打印错误日志with open(r"D:\pycharm_projects\xinwenyuan\xinwen\log\/"+date+".txt", "a+") as f:f.write(item["created_at"]+"error:"+error[1]+"\n")return item# 关闭数据库def close_spider(self, spider):print "working done..."self.cursor.close()self.connect.close()

5.在start.py中设置定时任务，实现每天定时爬取
在这里插入图片描述
最后是设置bat文件

定时任务这一部分参考了网上的文档，原链接如下：
https://blog.csdn.net/zwq912318834/article/details/77806737
将设置好的bat文件加入到计划执行任务中

其中，本次爬虫任务采用的scrapy默认的线程数，没有设置其他的多线程，没有使用代理ip，所以只设置了delay时间，这样爬虫随时会因为ip被封挂掉的，不过到现在还好，一切ok。
代码中还有很多其他的问题，小伙伴们可以留言交流[认真滑稽脸].jpg