【Python 爬虫 CASE】使用Requests+BeautifulSoup获取新闻网站文章内容并整理成表

news/2024/5/9 22:33:32/文章来源:https://blog.csdn.net/weixin_40844116/article/details/109202631

一、需求
获取新浪新闻网站(http://news.sina.com.cn/china/)首页的新闻内容
在这里插入图片描述 F12打开开发者工具，查看源代码

打开新闻链接，获取该新闻文章的“内容”、“标题”、“来源”
在这里插入图片描述
F12打开开发者工具，查看源代码

二、实现
Step1：构建抓取具体文章内容的函数

def getArticle(url):res = requests.get(url)res.encoding = 'utf-8'soup = BeautifulSoup(res.text, 'html.parser')dic = {}#构建字典键值对参数dic['title'] = soup.select('.main-title')[0].textdic['content'] = ''.join(soup.select('#article')[0].text.split())dic['source'] = soup.select('.date-source')[0].text#dic['keyword'] = soup.select('#keywords')[0].textreturn dic

Step2：从新闻首页获取需要抓取文章的url

res = requests.get('http://news.sina.com.cn/china/')
res.encoding= 'utf-8'soup = BeautifulSoup(res.text, 'html.parser')newsary = []
for link in soup.select('.news-1 li,.news-2 li'):#抓取到url#print(link.select('a')[0]['href'])#将url传入函数newsary.append(getArticle(link.select('a')[0]['href']))#print("---------------")

Step3：整合成表并进行清理
1.构建dataframe

import pandas
df = pandas.DataFrame(newsary)
df.head()

在这里插入图片描述 2.整理数据

#用正则拆分列
df[['datetime', 'from']] = df['source'].str.extract('\n(\d+年\d+月\d+日\s\d+:\d+)\n(\w+)', expand =False)
#更改数据类型
df['datetime'] = pandas.to_datetime(df['datetime'], format = '%Y年%m月%d日 %H:%M')
#删除多余列
del df['source']
df