2019独角兽企业重金招聘Python工程师标准>>>
1. 爬虫作用用网络爬虫技术让重复性的手工流程实现自动化处理。
2. 爬取准备a. 检查robots.txt在链接后加robots.txt查看是否有要求或限制User-agent : 后表示禁止的用户代理Crawl-delay : 后表示要求的爬取延迟Sitemap : 后的链接提供网站地图文件例:伯乐在线提供的网站地图b. 估算网站大小site: +网站链接或URL路径 (用goole吧)c. 识别网站所用技术i. 在windows powershell 中输入pip查看是否已安装pipii. 使用pip install builtwith安装 builtwith模块iii. 使用该模块将URL作为参数,对该URL进行分析>>> import builtwith>>> builtwith.parse('http://example.webscraping.com'){u'javascript-frameworks': [u'jQuery', u'Modernizr', u'jQuery UI'], u'web-frameworks': [u'Web2py', u'Twitter Bootstrap'], u'programming-languages': [u'Python'], u'web-servers': [u'Nginx'] }>>> builtwith.parse('http://jianshu.com'){u'javascript-frameworks': [u'Prototype', u'RequireJS'], u'web-frameworks': [u'Twitter Bootstrap', u'Ruby on Rails'],u'Twprogramming-languages': [u'Ruby'], u'web-servers': [u'Tengine']}>>> builtwith.parse('http://chinadaily.com.cn'){u'javascript-frameworks': [u'jQuery'], u'web-servers': [u'Nginx']}>>> builtwith.parse('http://oschina.net'){u'javascript-frameworks': [u'jQuery', u'Vue.js'], u'web-servers': [u'Tengine']}d. 寻找网站所有者i. 安装WHOIS协议封装库pip install python-whoisii. 使用>>>import whois>>> print whois.whois('jianshu.com'){"updated_date": ["2016-04-06 00:00:00","2016-04-06 10:24:47"],"status": ["clientTransferProhibited https://icann.org/epp#clientTransferProhibited","clientTransferProhibited"],"name": "Shanghai Bai Ji Information Technology Inc. Ltd,","dnssec": "Unsigned","city": "Shanghai","expiration_date": ["2020-03-20 00:00:00","2020-03-20 18:28:58"],"zipcode": "200433","domain_name": "JIANSHU.COM","country": "CN","whois_server": "whois.name.com","state": "Shanghai","registrar": "Name.com, Inc.","referral_url": "http://www.name.com","address": "Innospace 2, B1, Building #5, KIC, No.316 Songhu Road , Yangpu District","name_servers": ["F1G1NS1.DNSPOD.NET","F1G1NS2.DNSPOD.NET","f1g1ns1.dnspod.net","f1g1ns2.dnspod.net"],"org": "Shanghai Bai Ji Information Technology Inc. Ltd,","creation_date": ["2008-03-20 00:00:00","2008-03-20 18:28:58"],"emails": ["contact@jianshu.com","abuse@name.com"]}