24.网站更新数据监控-1

news/2024/5/13 18:52:41/文章来源:https://blog.csdn.net/weixin_34390105/article/details/94342691
24.网站更新数据监控-1
一.scrapy 对网站是否更新做监控

1.spider.py
# -*- coding: utf-8 -*-
import scrapy
import time
import re
from WEB.conmon.md5_tool import md5_encodefrom WEB.items import WebItemclass CompanyInfoSpider(scrapy.Spider):name = 'wenzhou'allowed_domains = ['wzszjw.wenzhou.gov.cn']start_urls = ['http://wzszjw.wenzhou.gov.cn/col/col1357901/index.html']custom_settings = {"DOWNLOAD_DELAY": 0.5,"ITEM_PIPELINES":{'WEB.pipelines.MysqlPipeline': 320},"DOWNLOADER_MIDDLEWARES": {'WEB.middlewares.RandomUaseragentMiddleware': 500,},}def parse(self, response):#gbk解码_response=response.text.encode('utf-8')# print(_response)# 转码_response=_response.decode('utf-8')texts=re.findall("<span>.*?</span><b>&middot;</b><a href=\'.*?\'",_response)str = ""for text in texts:str = str + "".join(text)# print(str)text_md5 = md5_encode(str)item = WebItem()item["website_name"] = "温州市建设工程造价管理处"item["website_url"] = response.urlitem["content_md5"] = text_md5item["date_time"] = time.time()print(item)yield item

2.spider引用 md5_tool.py 对获取标签内容加密确保入库的唯一性(后期对网站监控比对的字段对象 MD5的值)

# -*- coding:utf-8 -*-
import hashlib# md5 加密
def md5_encode(md5):md5 = md5hash = hashlib.md5()hash.update(bytes(md5, encoding='utf-8'))  # 要对哪个字符串进行加密,就放这里return hash.hexdigest() # 拿到加密字符串
3.通用的piplines.py 链接数据库
# -*- coding: utf-8 -*-
from scrapy.conf import settings
import pymysqlclass WebPipeline(object):def process_item(self, item, spider):return item# 数据保存mysql
class MysqlPipeline(object):def open_spider(self, spider):self.host = settings.get('MYSQL_HOST')self.port = settings.get('MYSQL_PORT')self.user = settings.get('MYSQL_USER')self.password = settings.get('MYSQL_PASSWORD')self.db = settings.get(('MYSQL_DB'))self.table = settings.get('TABLE')self.client = pymysql.connect(host=self.host, user=self.user, password=self.password, port=self.port, db=self.db, charset='utf8')def process_item(self, item, spider):item_dict = dict(item)cursor = self.client.cursor()values = ','.join(['%s'] * len(item_dict))keys = ','.join(item_dict.keys())sql = 'INSERT INTO {table}({keys}) VALUES ({values})'.format(table=self.table, keys=keys, values=values)try:if cursor.execute(sql, tuple(item_dict.values())):  # 第一个值为sql语句第二个为 值 为一个元组print('成功')self.client.commit()except Exception as e:print(e)print('失败')self.client.rollback()return itemdef close_spider(self, spider):self.client.close()

4.setting.py 配置

# -*- coding: utf-8 -*-# Scrapy settings for WEB project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'WEB'SPIDER_MODULES = ['WEB.spiders']
NEWSPIDER_MODULE = 'WEB.spiders'# mysql配置参数
MYSQL_HOST = "172.16.0.55"
MYSQL_PORT = 3306
MYSQL_USER = "root"
MYSQL_PASSWORD = "concom603"
MYSQL_DB = 'web_page'
TABLE = "web_page_update"# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'WEB (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'WEB.middlewares.WebSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'WEB.middlewares.WebDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'WEB.pipelines.WebPipeline': 300,
#}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

5.items.py 字段属性

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass WebItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()content_md5 = scrapy.Field() # 监控文本website_url = scrapy.Field() # 采集页面urlwebsite_name = scrapy.Field() # 网站名称date_time = scrapy.Field() # 当前时间戳

6.数据库建表

CREATE TABLE `web_page_update` (`id` int(22) NOT NULL AUTO_INCREMENT,`website_url` varchar(255) DEFAULT NULL COMMENT '网站url',`website_name` varchar(255) DEFAULT NULL COMMENT '采集网站名',`content_md5` varchar(255) DEFAULT NULL COMMENT '页面内容',`date_time` decimal(65,7) DEFAULT NULL COMMENT '时间戳',PRIMARY KEY (`id`),UNIQUE KEY `content_md5` (`content_md5`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=230 DEFAULT CHARSET=utf8;

7.执行爬虫文件

scrapy crawl wenzhou

E:\Spider\work_code\9-15\WEB(1)\WEB>scrapy crawl wenzhou
2018-09-17 10:48:14 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: WEB)
2018-09-17 10:48:14 [scrapy.utils.log] INFO: Versions: lxml 4.2.3.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.5.3 (v3.5.3:1880cb95a742, Jan16 2017, 16:02:32) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.3, Platform Windows-7-6.1.7601-SP1
2018-09-17 10:48:14 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'WEB', 'SPIDER_MODULES': ['WEB.spiders'], 'DOWNLOAD_DELAY': 0.5, 'NEWSPIDER_MODULE': 'WEB.spiders'}
2018-09-17 10:48:14 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.logstats.LogStats']
2018-09-17 10:48:14 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','WEB.middlewares.RandomUaseragentMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-09-17 10:48:14 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-09-17 10:48:14 [scrapy.middleware] INFO: Enabled item pipelines:
['WEB.pipelines.MysqlPipeline']
2018-09-17 10:48:14 [scrapy.core.engine] INFO: Spider opened
2018-09-17 10:48:14 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-09-17 10:48:14 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-09-17 10:48:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://wzszjw.wenzhou.gov.cn/col/col1357901/index.html> (referer: None)
{'content_md5': 'd0ecfb7ce1f4871e4cd091fc53982755','date_time': 1537152494.7628245,'website_name': '温州市建设工程造价管理处','website_url': 'http://wzszjw.wenzhou.gov.cn/col/col1357901/index.html'}
(1062, "Duplicate entry 'd0ecfb7ce1f4871e4cd091fc53982755' for key 'content_md5'")
失败
2018-09-17 10:48:14 [scrapy.core.scraper] DEBUG: Scraped from <200 http://wzszjw.wenzhou.gov.cn/col/col1357901/index.html>
{'content_md5': 'd0ecfb7ce1f4871e4cd091fc53982755','date_time': 1537152494.7628245,'website_name': '温州市建设工程造价管理处','website_url': 'http://wzszjw.wenzhou.gov.cn/col/col1357901/index.html'}
2018-09-17 10:48:14 [scrapy.core.engine] INFO: Closing spider (finished)
2018-09-17 10:48:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 296,'downloader/request_count': 1,'downloader/request_method_count/GET': 1,'downloader/response_bytes': 2855,'downloader/response_count': 1,'downloader/response_status_count/200': 1,'finish_reason': 'finished','finish_time': datetime.datetime(2018, 9, 17, 2, 48, 14, 767824),'item_scraped_count': 1,'log_count/DEBUG': 3,'log_count/INFO': 7,'response_received_count': 1,'scheduler/dequeued': 1,'scheduler/dequeued/memory': 1,'scheduler/enqueued': 1,'scheduler/enqueued/memory': 1,'start_time': datetime.datetime(2018, 9, 17, 2, 48, 14, 492809)}
2018-09-17 10:48:14 [scrapy.core.engine] INFO: Spider closed (finished)E:\Spider\work_code\9-15\WEB(1)\WEB>

由于我之前已经测试如过库,数据库已经存过相同的数据,所以报了失败,只有网站更新出现新数据,才会提示成功。

 

 

posted on 2018-09-14 19:07 五杀摇滚小拉夫 阅读(...) 评论(...) 编辑 收藏

转载于:https://www.cnblogs.com/lvjing/p/9648375.html

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.luyixian.cn/news_show_762966.aspx

如若内容造成侵权/违法违规/事实不符,请联系dt猫网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

网站忘记密码怎么找回?

网站忘记密码怎么找回? 两个方法 方法一&#xff1a; 登录数据库&#xff0c;把password 下面的这一串加密串 解密一下。我至今没这样弄过&#xff0c;原理是这样&#xff0c;但是我没找到好用的md5 解密网站。一般的都要收费的。 方法二: 在数据库重置一个简单的密码&#xf…

手机网站按住放大图片_让渣渣像素图片复活,推荐两个无损放大图片的网站

老Y在之前的文章中推荐过不少寻找高清大图的网站&#xff08;收集了3年的网站和大家分享&#xff09;​虽然大多数情况下能够找到比较高清的图片&#xff0c;但有些图片确实很难找到高清的&#xff0c;比如一些企业的Logo、专业性设备照片等&#xff0c;只能在官网上找到一些小…

建站公司需要结合哪方面来制作网站?

企业建网站目的就是为了推广自己品牌、产品、业务等项目。对一个企业来讲&#xff0c;建设网站可迅速树立自已的品牌形象&#xff0c;来提高知名度。但对于建设网站公司来讲&#xff0c;既要结合建站技术&#xff0c;又要根据企业需要来打造用户喜欢的网站。 可能有人会想&…

代码内容变成图片_SEO代码优化之img图片标签

在网站的页面中&#xff0c;图片是重要的组成元素之一&#xff0c;所以&#xff0c;对网站img图片进行SEO优化&#xff0c;不仅有利于用户的浏览体验提升&#xff0c;也有利于搜索引擎蜘蛛的抓取识别&#xff0c;还可以丰富页面的内容&#xff0c;提升页面得分&#xff0c;有利…

50万以内的网站服务器系统配置,每秒50万次请求 你的Web服务器能办到吗?

现在的 HTTP 服务器性能非常之高&#xff0c;在一些老的服务器上一样可以有非常棒的表现&#xff0c;下面是对 Nginx 1.0.14 自带的默认首页进行压力测试的结果&#xff0c;图表显示每秒请求数和并发连接数&#xff1a;在这张图中 Nginx 的最高处理能力达到每秒 50w 的请求数处…

爬取17k小说网站

传入网站的首页的url&#xff0c;获取首页多个书籍对应书名和对应书名的url 书籍对应书名和对应书名的url如下 部分代码如下 def getBook(url):getBookList []chrome webdriver.Chrome()chrome.get(url)texts chrome.find_elements_by_xpath(//ul[class"Top1"]/li…

Django的模板继承和加载以及常见网站类型

PyCharm快捷键 Ctrlf查找 Ctrlr替换 做项目过程中注意事项 前面记得加斜杠 代表从根匹配&#xff0c;加/很重要。不加的话代表从当前的目录开始。 模板继承 针对网站所有的网页通常使用一个模板&#xff0c;所有代码都有相同部分&#xff0c;为了减少代码冗余&#xff0c;方便代…

python做的网站_如何用Python做一个网站?

Install pip install Django 2. 新建一个2113工程 django-admin startproject mysite 然后5261&#xff0c;4102我们会得到一个这样的文件结构请点击输入图片描述 python manage.py runserver 0.0.0.0:8888 然后&#xff0c;你就可1653以在浏览器地址栏中输入http://127.0.0.1:…

分类信息网站源码_SEO优化:教你三招做好分类信息网站优化

各位小伙伴大家好~我是新人小编艾斯~今天给大家分享下分类信息网站优化分类信息网目前看似已经被赶集网、58同城这样的超大网站垄断&#xff0c;但是针对地方的分类信息&#xff0c;还是有做头的。比如朋友的“重庆二手it论坛”&#xff0c;就利用小众领域的分类信息进行突破&a…

网站备案负责人_网站icp备案流程是怎样的

目前在国内知名系统主机作网站&#xff0c;都是要按照国家法规来备案网站主办人信息&#xff0c;与以网站内容信息用途&#xff0c;都是要详细备案。那么&#xff0c;网站icp备案流程是怎样的?一起来了解一下吧。1.首先选择你备案的运营商&#xff0c;这里以百度云为例。注册好…

powerbi中python网站数据_PowerBI/Excel批量爬取网页数据超详细流程

前面介绍PowerBI数据获取的时候&#xff0c;曾举了一个从网页中获取数据的例子&#xff0c;但当时只是爬取了其中一页数据&#xff0c;这篇文章来介绍如何用PowerBI批量采集多个网页的数据。(Excel中的Power query可以同样操作)本文以智联招聘网站为例&#xff0c;采集工作地点…

网站克隆工具-httrack安装使用教程

1、安装 sudo apt-get isntall httarck 2、启动httrack cd 到一个文件夹 启动 httrack -*.gif www.*.com/*.zip -*img_*.zip 之后一路回车就行了&#xff0c;httrack会爬取所有网站的js文件以及图片保存到相应的文件夹中 3、克隆完成 过一会提示 Done.K Thanks for usi…

网站被百度提示有风险,该如何解决?网站被黑怎么办?

网站在最近被百度提示有风险&#xff0c;导致网站流量急剧的下滑&#xff0c;从百度点击进去会直接跳转到什 么BCdu博的网站上去&#xff0c;360提示&#xff1a;未经证实的BCdu博网站您访问的网站含有未经证实的境外BC网站的相关内容&#xff0c;可能给您造成财产损失&#xf…

网站被入侵,该如何查找黑客及网站漏洞?

当网站被攻击后&#xff0c;令人头疼的是网站哪里出现了问题&#xff0c;是谁在攻击我们&#xff0c;是利用了什么网站漏洞呢&#xff1f;如果要查找到黑客攻击的根源&#xff0c;通过服务器里留下的网站访问日志是一个很好的办法。为什么网站访问日志是如此的重要呢&#xff1…

网站安全检测服务之PHP代码的后台绕过登录漏洞

针对于PHP代码的开发的网站&#xff0c;最近在给客户做网站安全检测的同时&#xff0c;大大小小的都会存在网站的后台管理页面被绕过并直接登录后台的漏洞&#xff0c;而且每个网站的后台被绕过的方式都不一样&#xff0c;根据SINE安全渗透测试多年来经验&#xff0c;来总结一下…

网站被黑了被挂马篡改后我是如何解决网站被挂马!

1、发现被黑&#xff0c;网站被黑的症状两年前自己用wordpress搭了一个网站&#xff0c;平时没事写写文章玩玩。但是前些日子&#xff0c;突然发现网站的流量突然变小&#xff0c;site了一下百度收录&#xff0c;发现出了大问题&#xff0c;网站被黑了。大多数百度抓取收录的页…

javaweb网站安全问题web网站安全问题防范安全部署tomcat方法

Apache tomcat是JAVA开发&#xff0c;JSP运行首选的web环境&#xff0c;国内很多网站&#xff0c;以及平台都在使用tomcat 环境来运行网站&#xff0c;高效&#xff0c;稳定&#xff0c;安全&#xff0c;赢得了国内许多客户。tomcat 该如何安全设置与部署呢&#xff1f;SINE安全…

网站安全问题针对一句话木马函数的普析与防范

本文内容转载于Sinesafe网站安全服务http://www.sinesafe.com/article/20180608/244.htmlPHP网站安全防一句话木马入侵 一、首先是菜刀一句话木马&#xff1a; 菜刀一句话木马的原理是调用了PHP的代码执行函数&#xff0c;比如以下1和2两个常见的一句话菜刀马&#xff0c;就…

网站漏洞修复之CSRF跨站攻击

CSRF通俗来讲就是跨站伪造请求攻击&#xff0c;英文Cross-Site Request Forgery&#xff0c;在近几年的网站安全威胁排列中排前三&#xff0c;跨站攻击利用的是网站的用户在登陆的状态下&#xff0c;在用户不知不觉的情况下执行恶意代码以及执行网站的权限操作&#xff0c;CSRF…

PHP网站安全日志系统开发与部署

PHP架构网站在设计完成并交付给客户的同时&#xff0c;要对其php网站的安全日志&#xff0c;并要实时的监控网站的运行过程的安全情况&#xff0c;包括网站被攻击&#xff0c;网站被黑&#xff0c;被挂马&#xff0c;网站被跳转&#xff0c;等一些攻击特征&#xff0c;进行实时…