xpath提取招标网站的项目编号

news/2024/5/9 19:27:40/文章来源:https://blog.csdn.net/weixin_42961082/article/details/108935735

首先配置好一个爬虫文件,经过测试配置的URL接口OK,接下来需要通过xpath来提取数据(提取的数据根据自身需要)
先看下要爬取的网站页面信息:
有信息
再看下编写的代码信息:

import scrapy
import reclass BilianSpider(scrapy.Spider):name = 'bilian'allowed_domains = ['ebnew.com']# start_urls = ['http://ebnew.com/']# 存储的数据格式sql_data = dict(projectcode='',        # 项目编码web='',                # 信息来源网站keyword='',            # 关键字detail_url='',         # 招标详细页网址title='',              # 第三方网站发布标题toptype='',            # 信息类型province='',           # 归属省份Product='',            # 产品范畴industry='',           # 归属行业tendering_manner='',   # 招标方式publicity_date='',     # 招标公示日期expiry_date='',        # 招标截止日期)# form表单提交的数据格式form_data = dict(infoClassCodes='',rangeType='',projectType='bid',fundSourceCodes='',dateType='',startDateCode='',endDateCode='',normIndustry='',normIndustryName='',zone='',zoneName='',zoneText='',key='',   # 搜索的关键字pubDateType='',pubDateBegin='',pubDateEnd='',sortMethod='timeDesc',orgName='',currentPage=''  # 当前的页码)def start_requests(self):from_data = self.form_datafrom_data['key'] = '路由器'from_data['currentPage'] = '2'yield scrapy.FormRequest(url='https://ss.ebnew.com/tradingSearch/index.htm',formdata=from_data,callback=self.parse_page1,)# def parse_page1(self, response):#     with open('2.html','wb')as f:#         f.write(response.body)# //div[@class="abstract-box mg-t25 ebnew-border-bottom mg-r15"]/div/i[1]/text()def parse_page1(self, response):content_list_x_s = response.xpath('//div[@class="ebnew-content-list"]/div')for content_list_x in content_list_x_s:toptype = content_list_x.xpath('./div/i[1]/text()').extract_first()publicity_date= content_list_x.xpath('./div/i[2]/text()').extract_first()title = content_list_x.xpath('./div/a/text()').extract_first()tendering_manner = content_list_x.xpath('.//div[2]/div[1]/p[1]/span[2]/text()').extract_first()Product  = content_list_x.xpath('.//div[2]/div[1]/p[2]/span[2]/text()').extract_first()expiry_date = content_list_x.xpath('.//div[2]/div[2]/p[1]/span[2]/text()').extract_first()province = content_list_x.xpath('.//div[2]/div[2]/p[2]/spanp[2]/text()').extract_first()print(toptype, publicity_date, title,tendering_manner,Product, expiry_date, province)

新建一个start.py文件来运行该爬虫文件,

from scrapy import cmdlinecmdline.execute('scrapy crawl bilian'.split())

查看运行结果:

D:\Python3.8.5\python.exe D:/zhaobiao/zhaobiao/spiders/start.py
2020-10-06 06:31:34 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: zhaobiao)
2020-10-06 06:31:34 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:43:08) [MSC v.1926 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform Windows-10-10.0.18362-SP0
2020-10-06 06:31:34 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-06 06:31:34 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'zhaobiao','DOWNLOAD_DELAY': 3,'NEWSPIDER_MODULE': 'zhaobiao.spiders','SPIDER_MODULES': ['zhaobiao.spiders']}
2020-10-06 06:31:34 [scrapy.extensions.telnet] INFO: Telnet Password: ec14f54ed056d314
2020-10-06 06:31:34 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.logstats.LogStats']
2020-10-06 06:31:35 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','zhaobiao.middlewares.ZhaobiaoDownloaderMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-06 06:31:35 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-06 06:31:35 [scrapy.middleware] INFO: Enabled item pipelines:
['zhaobiao.pipelines.ZhaobiaoPipeline']
2020-10-06 06:31:35 [scrapy.core.engine] INFO: Spider opened
2020-10-06 06:31:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-06 06:31:35 [bilian] INFO: Spider opened: bilian
2020-10-06 06:31:35 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-06 06:31:35 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://ss.ebnew.com/tradingSearch/index.htm> (referer: None)
结果 发布日期:2020-09-30 合肥市包河区同安街道周谷堆社区居民委员会HF20200930144955483001直接采购... 询价采购  路由器,分流器 None None
公示 发布日期:2020-09-30 中国邮政集团有限公司云南省分公司4G无线 国内公开    2020-10-05 09:00:00 None
公示 发布日期:2020-09-30 未来网络试验设施国家重大科技基础设施项目深圳信息通信研究院承建系统—信... 国际公开  信号分析仪 2020-10-12 23:59:00 None
结果 发布日期:2020-09-30 内蒙古超高压供电局2020年第十一批集中采购(2020年三季度生产性消耗性材料... 询价采购  网络路由器 None None
公告 发布日期:2020-09-30 东莞市消防救援支队电子政务外网和指挥网链路租赁项目公开招标公告 国内公开  路由器 None None
公告 发布日期:2020-09-30 河南省滑县政务服务和大数据管理局滑县大数据中心建设项目-公开招标公告 国内公开  系统集成 None None
公告 发布日期:2020-09-30 内蒙古自治区团校内蒙古自治区团教务处标准化数字考场系统竞争性磋商 国内公开  路由器 None None
公告 发布日期:2020-09-30 西北工业大学翱翔体育馆网络升级改造采购项目招标公告 国内公开  负载均衡设备,检测设备 None None
结果 发布日期:2020-09-30 国家税务总局石家庄市税务局稽查局“四室一包”建设(硬件采购及集成)中标... 单一来源采购  音视频播放设备,光端机,扩音设备,LED大屏,... None None
公告 发布日期:2020-09-30 中国信息通信研究院工业互联网标识解析国家顶级节点(一期)建设——国拨追... 国内公开  系统集成 None None
2020-10-06 06:31:35 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-06 06:31:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 783,'downloader/request_count': 1,'downloader/request_method_count/POST': 1,'downloader/response_bytes': 14229,'downloader/response_count': 1,'downloader/response_status_count/200': 1,'elapsed_time_seconds': 0.333967,'finish_reason': 'finished','finish_time': datetime.datetime(2020, 10, 5, 22, 31, 35, 613753),'log_count/DEBUG': 1,'log_count/INFO': 11,'response_received_count': 1,'scheduler/dequeued': 1,'scheduler/dequeued/memory': 1,'scheduler/enqueued': 1,'scheduler/enqueued/memory': 1,'start_time': datetime.datetime(2020, 10, 5, 22, 31, 35, 279786)}
2020-10-06 06:31:35 [scrapy.core.engine] INFO: Spider closed (finished)Process finished with exit code 0

中间的文字部分获取到的招标信息,其中None表示该处无数据可供提取,因此返回None。

接下来提取二级页面的信息,先看页面内容:
在这里插入图片描述继续编写代码:

    def start_requests(self):from_data = self.form_datafrom_data['key'] = '路由器'from_data['currentPage'] = '2'yield scrapy.FormRequest(url='https://ss.ebnew.com/tradingSearch/index.htm',formdata=from_data,callback=self.parse_page1,)# def parse_page1(self, response):#     with open('2.html','wb')as f:#         f.write(response.body)# //div[@class="abstract-box mg-t25 ebnew-border-bottom mg-r15"]/div/i[1]/text()def parse_page1(self, response):content_list_x_s = response.xpath('//div[@class="ebnew-content-list"]/div')for content_list_x in content_list_x_s:toptype = content_list_x.xpath('./div/i[1]/text()').extract_first()publicity_date= content_list_x.xpath('./div/i[2]/text()').extract_first()title = content_list_x.xpath('./div/a/text()').extract_first()tendering_manner = content_list_x.xpath('.//div[2]/div[1]/p[1]/span[2]/text()').extract_first()Product  = content_list_x.xpath('.//div[2]/div[1]/p[2]/span[2]/text()').extract_first()expiry_date = content_list_x.xpath('.//div[2]/div[2]/p[1]/span[2]/text()').extract_first()province = content_list_x.xpath('.//div[2]/div[2]/p[2]/spanp[2]/text()').extract_first()# print(toptype, publicity_date, title,tendering_manner,Product, expiry_date, province)def parse_page2(self, response):content_list_x_s1 = response.xpath('//ul[contains(@class,"ebnew-project-information")]/li')projectcode = content_list_x_s1[0].xpath('./span[2]/text()').extract_first()industry = content_list_x_s1[7].xpath('./span[2]/text()').extract_first()print(projectcode, industry)

再次运行:

D:\Python3.8.5\python.exe D:/zhaobiao/zhaobiao/spiders/start.py
2020-10-06 06:41:27 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: zhaobiao)
2020-10-06 06:41:27 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:43:08) [MSC v.1926 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform Windows-10-10.0.18362-SP0
2020-10-06 06:41:27 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-06 06:41:27 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'zhaobiao','DOWNLOAD_DELAY': 3,'NEWSPIDER_MODULE': 'zhaobiao.spiders','SPIDER_MODULES': ['zhaobiao.spiders']}
2020-10-06 06:41:27 [scrapy.extensions.telnet] INFO: Telnet Password: e3462c71dbccaed2
2020-10-06 06:41:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.logstats.LogStats']
2020-10-06 06:41:28 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','zhaobiao.middlewares.ZhaobiaoDownloaderMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-06 06:41:28 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-06 06:41:28 [scrapy.middleware] INFO: Enabled item pipelines:
['zhaobiao.pipelines.ZhaobiaoPipeline']
2020-10-06 06:41:28 [scrapy.core.engine] INFO: Spider opened
2020-10-06 06:41:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-06 06:41:28 [bilian] INFO: Spider opened: bilian
2020-10-06 06:41:28 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-06 06:41:28 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://ss.ebnew.com/tradingSearch/index.htm> (referer: None)
2020-10-06 06:41:28 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-06 06:41:28 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 783,'downloader/request_count': 1,'downloader/request_method_count/POST': 1,'downloader/response_bytes': 14229,'downloader/response_count': 1,'downloader/response_status_count/200': 1,'elapsed_time_seconds': 0.344614,'finish_reason': 'finished','finish_time': datetime.datetime(2020, 10, 5, 22, 41, 28, 813691),'log_count/DEBUG': 1,'log_count/INFO': 11,'response_received_count': 1,'scheduler/dequeued': 1,'scheduler/dequeued/memory': 1,'scheduler/enqueued': 1,'scheduler/enqueued/memory': 1,'start_time': datetime.datetime(2020, 10, 5, 22, 41, 28, 469077)}
2020-10-06 06:41:28 [scrapy.core.engine] INFO: Spider closed (finished)Process finished with exit code 0

运行后没有提取到任何信息,至少应该返回一个空列表吧,
接下来进入问题分析和排查的过程,
检查xpath语法:

content_list_x_s1 = response.xpath('//ul[contains(@class,"ebnew-project-information")]/li')projectcode = content_list_x_s1[0].xpath('./span[2]/text()').extract_first()industry = content_list_x_s1[7].xpath('./span[2]/text()').extract_first()

语法反复检查,并且在浏览器中反复测试,能够获取到projectcode和industry
xpath自身问题,改用正则表示式提取,
先导入re模块,编写正则表达式,

    def parse_page2(self, response):content_list_x_s1 = response.xpath('//ul[contains(@class,"ebnew-project-information")]/li')projectcode = content_list_x_s1[0].xpath('./span[2]/text()').extract_first()industry = content_list_x_s1[7].xpath('./span[2]/text()').extract_first()if not projectcode:projectcode_find = re.findall('项目编号[::]{0,1}\s{0,2}([a-zA-Z0-9]{10,80})', response.body.decode('utf-8'))if projectcode_find:projectcode = projectcode_find[0]else:print('-'*90)print(projectcode, industry)

再次运行,看结果,

D:\Python3.8.5\python.exe D:/zhaobiao/zhaobiao/spiders/start.py
2020-10-06 06:48:50 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: zhaobiao)
2020-10-06 06:48:50 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:43:08) [MSC v.1926 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform Windows-10-10.0.18362-SP0
2020-10-06 06:48:50 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-06 06:48:50 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'zhaobiao','DOWNLOAD_DELAY': 3,'NEWSPIDER_MODULE': 'zhaobiao.spiders','SPIDER_MODULES': ['zhaobiao.spiders']}
2020-10-06 06:48:50 [scrapy.extensions.telnet] INFO: Telnet Password: adb2ef029ca73dfb
2020-10-06 06:48:50 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.logstats.LogStats']
2020-10-06 06:48:51 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','zhaobiao.middlewares.ZhaobiaoDownloaderMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-06 06:48:51 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-06 06:48:51 [scrapy.middleware] INFO: Enabled item pipelines:
['zhaobiao.pipelines.ZhaobiaoPipeline']
2020-10-06 06:48:51 [scrapy.core.engine] INFO: Spider opened
2020-10-06 06:48:51 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-06 06:48:51 [bilian] INFO: Spider opened: bilian
2020-10-06 06:48:51 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-06 06:48:51 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://ss.ebnew.com/tradingSearch/index.htm> (referer: None)
2020-10-06 06:48:51 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-06 06:48:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 783,'downloader/request_count': 1,'downloader/request_method_count/POST': 1,'downloader/response_bytes': 14229,'downloader/response_count': 1,'downloader/response_status_count/200': 1,'elapsed_time_seconds': 0.347671,'finish_reason': 'finished','finish_time': datetime.datetime(2020, 10, 5, 22, 48, 51, 602575),'log_count/DEBUG': 1,'log_count/INFO': 11,'response_received_count': 1,'scheduler/dequeued': 1,'scheduler/dequeued/memory': 1,'scheduler/enqueued': 1,'scheduler/enqueued/memory': 1,'start_time': datetime.datetime(2020, 10, 5, 22, 48, 51, 254904)}
2020-10-06 06:48:51 [scrapy.core.engine] INFO: Spider closed (finished)Process finished with exit code 0

结果无变化,所以到底是哪里出了问题,是正则表示的问题?而且即使匹配不成功,通过findall返回的也应该是一个列表,这里完全看不到任何信息,好像就没有被调用,
因此,是爬虫文件被封了?(毕竟我没有使用代理IP)
有点儿沮丧…

排查问题还是从源头去看,在运行结果的状态码信息,

2020-10-06 06:48:51 [bilian] INFO: Spider opened: bilian
2020-10-06 06:48:51 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-06 06:48:51 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://ss.ebnew.com/tradingSearch/index.htm> (referer: None)
2020-10-06 06:48:51 [scrapy.core.engine] INFO: Closing spider (finished)

DEBUG显示200,说明访问是OK的,但仔细看POST请求信息,这是第一次页面信息,
再看我定义的parse_page2

def parse_page2(self, response):

该代码是要被callback调用的,

    def start_requests(self):from_data = self.form_datafrom_data['key'] = '路由器'from_data['currentPage'] = '2'yield scrapy.FormRequest(url='https://ss.ebnew.com/tradingSearch/index.htm',formdata=from_data,callback=self.parse_page1,)

但实际上我的callback 调用却是parse_page1(上一个页面信息)

因此,需要再定义一个start_request(直接复制上一个start_requet,注释掉from_data信息,callback回调parse_page)

   def start_requests(self):# from_data = self.form_data# from_data['key'] = '路由器'# from_data['currentPage'] = '2'yield scrapy.Request(url='https://ss.ebnew.com/tradingSearch/index.htm',formdata=from_data,callback=self.parse_page2,)

重新运行下,

D:\Python3.8.5\python.exe D:/zhaobiao/zhaobiao/spiders/start.py
2020-10-06 11:13:26 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: zhaobiao)
2020-10-06 11:13:26 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:43:08) [MSC v.1926 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform Windows-10-10.0.18362-SP0
2020-10-06 11:13:26 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-06 11:13:27 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'zhaobiao','DOWNLOAD_DELAY': 3,'NEWSPIDER_MODULE': 'zhaobiao.spiders','SPIDER_MODULES': ['zhaobiao.spiders']}
2020-10-06 11:13:27 [scrapy.extensions.telnet] INFO: Telnet Password: c68f138b1a7f0a2d
2020-10-06 11:13:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.logstats.LogStats']
2020-10-06 11:13:30 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','zhaobiao.middlewares.ZhaobiaoDownloaderMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-06 11:13:30 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-06 11:13:30 [scrapy.middleware] INFO: Enabled item pipelines:
['zhaobiao.pipelines.ZhaobiaoPipeline']
2020-10-06 11:13:30 [scrapy.core.engine] INFO: Spider opened
2020-10-06 11:13:30 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-06 11:13:30 [bilian] INFO: Spider opened: bilian
2020-10-06 11:13:30 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-06 11:13:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.ebnew.com/businessShow/653378328.html> (referer: None)
HF20200930144955483001 ;网络设备;电工仪表;
2020-10-06 11:13:30 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-06 11:13:30 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 469,'downloader/request_count': 1,'downloader/request_method_count/GET': 1,'downloader/response_bytes': 9957,'downloader/response_count': 1,'downloader/response_status_count/200': 1,'elapsed_time_seconds': 0.559952,'finish_reason': 'finished','finish_time': datetime.datetime(2020, 10, 6, 3, 13, 30, 676786),'log_count/DEBUG': 1,'log_count/INFO': 11,'response_received_count': 1,'scheduler/dequeued': 1,'scheduler/dequeued/memory': 1,'scheduler/enqueued': 1,'scheduler/enqueued/memory': 1,'start_time': datetime.datetime(2020, 10, 6, 3, 13, 30, 116834)}
2020-10-06 11:13:30 [scrapy.core.engine] INFO: Spider closed (finished)Process finished with exit code 0
020-10-06 11:13:30 [bilian] INFO: Spider opened: bilian
2020-10-06 11:13:30 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-06 11:13:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.ebnew.com/businessShow/653378328.html> (referer: None)
HF20200930144955483001 ;网络设备;电工仪表;
2020-10-06 11:13:30 [scrapy.core.engine] INFO: Closing spider (finished)

拿到了项目编号:HF20200930144955483001
也拿到了所属行业:网络设备;电工仪表;

问题是驱动进步的,解决了一个问题不仅仅代表对这个领域的知识掌握,也梳理了自身分析问题的能力方面的缺陷。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.luyixian.cn/news_show_825876.aspx

如若内容造成侵权/违法违规/事实不符,请联系dt猫网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

#scrapy实战# 爬取招标网站信息(一)

先贴上项目的背景信息&#xff1a; 如上表格即为需要爬取到的信息&#xff0c;根据提取要求&#xff0c;先分析需要提取的内容都分布在目标网站哪里&#xff0c; 先打开目标网站&#xff0c;这里以必联网为例&#xff0c;假设搜索的关键字为&#xff1a;路由器 此网站打开&…

云服务器nginx部署静态网站,云服务器nginx部署静态网站

云服务器nginx部署静态网站 内容精选换一换华为云Web应用上云解决方案&#xff0c;基于企业业务访问量&#xff0c;提供多粒度Web应用部署解决方案来自&#xff1a;解决方案本文档指导用户使用华为云市场镜像“Moodle LMS在线学习系统(LAMP)”部署Moodle课程管理系统。Moodle是…

大型网站架构演变和知识体系【架构演变第一步:物理分离webserver和数据库】...

大型网站架构演变和知识体系 之前也有一些介绍大型网站架构演变的文章&#xff0c;例如LiveJournal的、ebay的&#xff0c;都是非常值得参考的&#xff0c;不过感觉他们讲的更多的是每次演变的结果&#xff0c;而没有很详细的讲为什么需要做这样的演变&#xff0c;再加上近来感…

大型网站架构演变和知识体系【架构演变第二步:增加页面缓存】

架构演变第二步&#xff1a;增加页面缓存 好景不长&#xff0c;随着访问的人越来越多&#xff0c;你发现响应速度又开始变慢了&#xff0c;查找原因&#xff0c;发现是访问数据库的操作太多&#xff0c;导致数据连接竞争激烈&#xff0c;所以响应变慢&#xff0c;但数据库连接…

大型网站架构演变和知识体系【架构演变第三步:增加页面片段缓存】

架构演变第三步&#xff1a;增加页面片段缓存 增加了squid做缓存后&#xff0c;整体系统的速度确实是提升了&#xff0c;webserver的压力也开始下降了&#xff0c;但随着访问量的增加&#xff0c;发现系统又开始变的有些慢了&#xff0c;在尝到了squid之类的动态缓存带来的好处…

大型网站架构演变和知识体系【架构演变第四步:数据缓存】

架构演变第四步&#xff1a;数据缓存 在采用ESI之类的技术再次提高了系统的缓存效果后&#xff0c;系统的压力确实进一步降低了&#xff0c;但同样&#xff0c;随着访问量的增加&#xff0c;系统还是开始变慢&#xff0c;经过查找&#xff0c;可能会发现系统中存在一些重复获取…

大型网站架构演变和知识体系【架构演变第五步: 增加webserver】

架构演变第五步&#xff1a; 增加webserver 好景不长&#xff0c;发现随着系统访问量的再度增加&#xff0c;webserver机器的压力在高峰期会上升到比较高&#xff0c;这个时候开始考虑增加一台webserver&#xff0c;这也是为了同时解决可用性的问题&#xff0c;避免单台的webs…

大型网站架构演变和知识体系【架构演变第六步:分库】

架构演变第六步&#xff1a;分库 享受了一段时间的系统访问量高速增长的幸福后&#xff0c;发现系统又开始变慢了&#xff0c;这次又是什么状况呢&#xff0c;经过查找&#xff0c;发现数据库写入、更新的这些操作的部分数据库连接的资源竞争非常激烈&#xff0c;导致了系统变…

大型网站架构演变和知识体系【架构演变第七步:分表、DAL和分布式缓存】

架构演变第七步&#xff1a;分表、DAL和分布式缓存 随着系统的不断运行&#xff0c;数据量开始大幅度增长&#xff0c;这个时候发现分库后查询仍然会有些慢&#xff0c;于是按照分库的思想开始做分表的工作&#xff0c;当然&#xff0c;这不可避免的会需要对程序进行一些修改&…

大型网站架构演变和知识体系【架构演变第八步:增加更多的webserver】

架构演变第八步&#xff1a;增加更多的webserver 在做完分库分表这些工作后&#xff0c;数据库上的压力已经降到比较低了&#xff0c;又开始过着每天看着访问量暴增的幸福生活了&#xff0c;突然有一天&#xff0c;发现系统的访问又开始有变慢的趋势了&#xff0c;这个时候首先…

大型网站架构演变和知识体系【 架构演变第九步:数据读写分离和廉价存储方案】...

架构演变第九步&#xff1a;数据读写分离和廉价存储方案 突然有一天&#xff0c;发现这个完美的时代也要结束了&#xff0c;数据库的噩梦又一次出现在眼前了&#xff0c;由于添加的webserver太多了&#xff0c;导致数据库连接的资源还是不够用&#xff0c;而这个时候又已经分库…

大型网站架构演变和知识体系【架构演变第十步:进入大型分布式应用时代和廉价服务器群梦想时代】...

架构演变第十步&#xff1a;进入大型分布式应用时代和廉价服务器群梦想时代 经过上面这个漫长而痛苦的过程&#xff0c;终于是再度迎来了完美的时代&#xff0c;不断的增加webserver就可以支撑越来越高的访问量了&#xff0c;对于大型网站而言&#xff0c;人气的重要毋庸置疑&…

二十个你必须知道的SEO概念

如果你拥有一个网站或独立博客&#xff0c;或者你的工作多少和互联网有关&#xff0c;那你一定耳濡目染多多少少对SEO(搜索引擎优化)有一定了解。本文将列举其中20个SEO领域最常用的名词和概念&#xff0c;如果你打算熟悉和了解他们请继续阅读。当然&#xff0c;如果你已经无所…

分享一个飘浪主题下载的网站

http://www.ommoo.com/ Q&#xff1a;装了某款主题后&#xff0c;进入某些网站输入用户名这些选项框很小&#xff0c;而且不能显示输入的数字.&#xff0c;请问是怎么回事&#xff1f;A&#xff1a;此网友应该是用的IE8的浏览器&#xff0c;目前IE8都是测试版&#xff0c;很多…

免费的PSD分享网站http://freepsdfiles.net/

本文与大家分享42个精美的PSD资源。非常感谢那些很有才华的设计师分享它们的劳动成果&#xff0c;让更多的设计师可以使用他们的创意设计。本文所有素材来自于&#xff1a;http://freepsdfiles.net 在那&#xff0c;你将找到更多更精美的素材&#xff01; 1. Circle Arrows PS…

MetInfo企业网站管理系统v5.1 正式版【免费下载使用】

MetInfo企业网站管理系统v5.1 正式版【免费下载使用】 MetInfo让你一天甚至更短的时间就能上线网站。而且网站后台操作极其便捷&#xff0c;一切都是基于用户体验和极其简易的操作而设计。

如何测试一个网站的性能(并发数)?

JMeter网站并发性测试 Apache JMeter是Apache组织开发的基于Java的压力测试工具。用于对软件做压力测试&#xff0c;它最初被设计用于Web应用测试但后来扩展到其他测试领域。 它可以用于测试静态和动态资源例如静态文件、Java小服务程序、CGI脚本、Java 对象、数据库&#xff0…

简读clubof网站源码之后的思考

注&#xff1a;本文所阅读的clubof源码版本为FrienDevSourceCode_20081028&#xff0c;即2008年10月28日。按说昨天刚参加“微软技术创新日--北京站”活动之后&#xff0c; 今天就来评论其活动中产品的一些问题显得不太厚道。但本文内容绝不应当看作是关于clubof的负面评论&…

推荐一款niubility的网站技术分析插件

Wappalyzer是一款功能强大的、且非常实用的网站技术分析插件&#xff0c;通过该插件能够分析目标网站所采用的平台构架、网站环境、服务器配置环境、JavaScript框架、编程语言等参数。 Wappalyzer使用方法 1、安装插件 Wappalyzer支持chrome、firefox浏览器。用户可以在官网…

分享10个ico图标搜索下载网站

为大家提供10个icon搜索下载的网站&#xff0c;这些图标都有明确的分类&#xff0c;你可以从中选出很多精美的图标哦。 1. IconsPedia IconsPedia是一个搜索下载png图片的地方&#xff0c;里面含有海量的图标&#xff01; 2. veryico 超过1000组的20000高质量的web图标。每个图…