python scrapy 实战简书网站保存数据到mysql

news/2024/5/8 16:21:47/文章来源:https://blog.csdn.net/weixin_30566111/article/details/99662105

 

1:创建项目

2:创建爬虫

3:编写start.py文件用于运行爬虫程序

# -*- coding:utf-8 -*-
#作者:    baikai  
#创建时间: 2018/12/14 14:09 
#文件:    start.py  
#IDE:    PyCharm
from scrapy import cmdlinecmdline.execute("scrapy crawl js".split())

4:设置settings.py文件的相关设置

爬取详情页数据

编写items.py文件

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ArticleItem(scrapy.Item):# 定义我们需要的存储数据字段title=scrapy.Field()content=scrapy.Field()article_id=scrapy.Field()origin_url=scrapy.Field()author=scrapy.Field()avatar=scrapy.Field()pub_time=scrapy.Field()

编写js.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from jianshu_spider.items import ArticleItemclass JsSpider(CrawlSpider):name = 'js'allowed_domains = ['jianshu.com']start_urls = ['https://www.jianshu.com/']rules = (# 匹配地址https://www.jianshu.com/p/d8804d18d638Rule(LinkExtractor(allow=r'.*/p/[0-9a-z]{12}.*'), callback='parse_detail', follow=True),)def parse_detail(self, response):# 获取内容页数据并解析数据title=response.xpath("//h1[@class='title']/text()").get()#作者图像avatar=response.xpath("//a[@class='avatar']/img/@src").get()author=response.xpath("//span[@class='name']/a/text()").get()#发布时间pub_time=response.xpath("//span[@class='publish-time']/text()").get()#详情页idurl=response.url#https://www.jianshu.com/p/d8804d18d638url1=url.split("?")[0]article_id=url1.split("/")[-1]#文章内容content=response.xpath("//div[@class='show-content']").get()item=ArticleItem(title=title,avatar=avatar,author=author,pub_time=pub_time,origin_url=response.url,article_id=article_id,content=content)yield item

 

设计数据库和表

数据库jianshu

表article

id设置为自动增长

 

 将爬取到的数据存储到mysql数据库中

 

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymysql
from twisted.enterprise import adbapi
from pymysql import cursorsclass JianshuSpiderPipeline(object):def __init__(self):dbparams = {'host': '127.0.0.1','port': 3306,'user': 'root','password': '8Wxx.ypa','database': 'jianshu','charset': 'utf8'}self.conn = pymysql.connect(**dbparams)self.cursor = self.conn.cursor()self._sql = Nonedef process_item(self, item, spider):self.cursor.execute(self.sql, (item['title'], item['content'], item['author'], item['avatar'], item['pub_time'], item['origin_url'],item['article_id']))self.conn.commit()return item@propertydef sql(self):if not self._sql:self._sql = """insert into article(id,title,content,author,avatar,pub_time,origin_url,article_id) values(null,%s,%s,%s,%s,%s,%s,%s)"""return self._sqlreturn self._sql

运行start.py效果如下

 

转载于:https://www.cnblogs.com/bkwxx/p/10120540.html

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.luyixian.cn/news_show_785211.aspx

如若内容造成侵权/违法违规/事实不符,请联系dt猫网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Flask项目之手机端租房网站的实战开发(十)

说明:该篇博客是博主一字一码编写的,实属不易,请尊重原创,谢谢大家! 接着上一篇博客继续往下写 :https://blog.csdn.net/qq_41782425/article/details/86488529 目录 一丶区县信息前端编写 二丶发布新房源后…

MVC4.0网站发布和部署到IIS7.0上的方法

最近在研究MVC4,使用vs2010,开发的站点在发布和部署到iis7上的过程中遇到了很多问题,现在将解决的过程记录下来,以便日后参考,整个过程主要以截图形式呈现 vs2010的安装和mvc4的安装不在本次记录之列,主要记…

抓取某一个网站整站的记录

2019独角兽企业重金招聘Python工程师标准>>> 经常由于某些原因我们需要爬取某一个网站或者直接复制某一个站点,到网上找了很多工具进行测试,试了很多各有各的问题,最终选择了Teleport Ultra,用起来效果很好&#xff1b…

防弹您的Drupal网站

“When you steal money or goods, somebody will notice it’s gone. When you steal information, most of the time no one will notice because the information is still in their possession.” – Kevin Mitnick, The Art of Deception, 2003.“当您窃取金钱或商品时&am…

Magento电子商务网站的SEO指南

Magento is only five years old but is already the most popular open-source eCommerce platform on the net, boasting a community of over 150,000 online retailers. Magento只有5岁,但已经是网络上最受欢迎的开源电子商务平台,拥有超过150,000个…

使用WordPress构建非博客网站

This post digs into how you can use WordPress to run a regular, non-blog website. 这篇文章深入探讨了如何使用WordPress来运行常规的非博客网站。 WordPress began as a blogging platform and has a long-established dominance in the world of blogging. No matter w…

2m带宽允许多少用户访问_您有多少用户需要可访问的网站?

2m带宽允许多少用户访问The Web Content Accessibility Guidelines (WCAG) came into existence in order to provide equal access and equal opportunity to people with disabilities. If the Web is accessible, many people with disabilities can communicate and intera…

谷歌深度神经网络_本周关注我们:轻松阅读,神经网络和Google召集不良网站

谷歌深度神经网络发展更好的体验 (Developing a Better Experience) With the wide variety of devices people use to browse the web today, steps are being taken to try and maintain peoples quality of experience. Google are now calling out pages that will not wor…

modern php_如何使用Modern.IE在本地测试您的网站

modern phpThis article was sponsored by Modern.IE. Thanks for supporting the sponsors that make SitePoint possible! 本文由Modern.IE赞助。 感谢您支持使SitePoint成为可能的赞助商! There’s no shortage of front end tools to help us test the qualit…

web应用程序和web网站_Web应用程序是未来

web应用程序和web网站Native mobile apps are a little weird, if you stop and think about them. 如果您停下来考虑一下,本机移动应用程序会有些奇怪。 The average mobile app weighs around 20MB, often requires an internet connection in order to be used …

joomla一键部署_如何在阿里云ECS上部署和托管Joomla网站

joomla一键部署This article was originally published on Alibaba Cloud. Thank you for supporting the partners who make SitePoint possible. 本文最初发表在阿里云上 。 感谢您支持使SitePoint成为可能的合作伙伴。 Joomla! is a free and open source content manageme…

seo策略_改善参与度指标的5种基本SEO策略

seo策略Every time someone types in a search query on Google, they’re given a list of results. 每当有人在Google上输入搜索查询时,就会得到一个结果列表。 The way in which those results are ordered is a highly complex algorithmic process that take…

wordpress环境安装_为什么暂存环境对于WordPress网站至关重要

wordpress环境安装This article is part of a series created in partnership with SiteGround. Thank you for supporting the partners who make SitePoint possible. 本文是与SiteGround合作创建的系列文章的一部分。 感谢您支持使SitePoint成为可能的合作伙伴。 Have you …

什么是您网站的正确图像格式?

This article is part of a series created in partnership with SiteGround. Thank you for supporting the partners who make SitePoint possible. 本文是与SiteGround合作创建的系列文章的一部分。 感谢您支持使SitePoint成为可能的合作伙伴。 As of March 2017, images m…

seo服务端渲染_我应该向SEO服务收取多少费用?

seo服务端渲染Good SEO might be the very reason you’re reading this article. As its name implies, SEO is the process of optimizing a website or web page to maximize visibility in search engine results, and it is hugely important for any business that maint…

wordpress 邮件_将您的WordPress网站变成电子邮件营销机器

wordpress 邮件With over 35,000 plugins available, there are few things that you cannot do with WordPress. 有超过35,000个可用的插件,几乎没有WordPress无法做的事情。 For many site builders, the ultimate goal is to reach as many viewers as possible…

微信小程序选择图片优化_如何选择完美的图像格式来优化您的网站

微信小程序选择图片优化This article was sponsored by Cloudinary. Thank you for supporting the partners who make SitePoint possible. 本文由Cloudinary赞助。 感谢您支持使SitePoint成为可能的合作伙伴。 When adding images to a website, most of us will instinctiv…

Visual Studio Community 2015:设置网站

This article was sponsored by Microsoft. Thank you for supporting the sponsors who make SitePoint possible. 本文由Microsoft赞助。 感谢您支持使SitePoint成为可能的赞助商。 For this series of articles, we’re going to use Microsoft’s modern IDE: Visual Stud…

wordpress移动_如何轻松地将WordPress网站转换为移动应用程序

wordpress移动Whether you are an advanced or novice WordPress user, for many companies and organizations, having an mobile app for your website can be a huge asset for improving overall reach. Unfortunately mobile development typically is a labor intensive …

八岐大蛇蛇的伤害算谁的伤害_避免诱惑损害网站性能

八岐大蛇蛇的伤害算谁的伤害SitePoint recently published Lean Websites, by Barbara Bermes—a book that presents the latest techniques for improving web page performance. In this article, Barbara details some of the performance pitfalls that all too often cat…