【Python 爬虫 CASE】使用Selenium+BeautifulSoup获取新闻网站文章列表

news/2024/5/9 20:52:15/文章来源:https://blog.csdn.net/weixin_40844116/article/details/108103653

一、需求

获取腾讯新闻网站(https://news.qq.com/)首页的新闻标题和列表
在这里插入图片描述 F12打开开发者工具，查看源代码

二、实现

Step1：获取网页源代码

如果使用requests库获取源代码

import requests
res = requests.get('http://news.qq.com/')

但是这种方式获取的源代码由于渲染，和实际查看到的不一致，因此，requests获取方式用不上，需要使用Selenium库的webdriver

from selenium import webdriver
driver=webdriver.Chrome()
driver.get('http://news.qq.com/')
#1.执行js命令
html=driver.execute_script("return document.documentElement.outerHTML")
#2.或者使用查找元素定位整个html文档
#html = driver.find_element_by_xpath("//*").get_attribute("outerHTML")
driver.close()

或者手动下载该页面的源代码，将其转换成字符串或文件样式

html=open(r'G:\temp files\qq.htm')

Step2：使用css选择器获取元素，并解析数据

from bs4 import BeautifulSoup
#传入源代码
soup = BeautifulSoup(html, 'html.parser')#将文章标题和链接提取出来，存储到一个字典列表
newsary = []
for news in soup.select('.detail .""'):newsary.append({'title':news.select('a')[0].text, 'url':news.select('a')[0]['href']})#构建一个dataframe，输出保存    
import pandas
newsdf = pandas.DataFrame(newsary)
newsdf.to_excel(r'G:\temp files\qqnews.xlsx')
nessdf