一、需求
获取新浪新闻网站(http://news.sina.com.cn/china/)首页的新闻内容
F12打开开发者工具,查看源代码
打开新闻链接,获取该新闻文章的“内容”、“标题”、“来源”
F12打开开发者工具,查看源代码
二、实现
Step1:构建抓取具体文章内容的函数
def getArticle(url):res = requests.get(url)res.encoding = 'utf-8'soup = BeautifulSoup(res.text, 'html.parser')dic = {}#构建字典键值对参数dic['title'] = soup.select('.main-title')[0].textdic['content'] = ''.join(soup.select('#article')[0].text.split())dic['source'] = soup.select('.date-source')[0].text#dic['keyword'] = soup.select('#keywords')[0].textreturn dic
Step2:从新闻首页获取需要抓取文章的url
res = requests.get('http://news.sina.com.cn/china/')
res.encoding= 'utf-8'soup = BeautifulSoup(res.text, 'html.parser')newsary = []
for link in soup.select('.news-1 li,.news-2 li'):#抓取到url#print(link.select('a')[0]['href'])#将url传入函数newsary.append(getArticle(link.select('a')[0]['href']))#print("---------------")
Step3:整合成表并进行清理
1.构建dataframe
import pandas
df = pandas.DataFrame(newsary)
df.head()
2.整理数据
#用正则拆分列
df[['datetime', 'from']] = df['source'].str.extract('\n(\d+年\d+月\d+日\s\d+:\d+)\n(\w+)', expand =False)
#更改数据类型
df['datetime'] = pandas.to_datetime(df['datetime'], format = '%Y年%m月%d日 %H:%M')
#删除多余列
del df['source']
df
2.保存文件
df.to_excel('news.xlsx')