Scrapy和Django实现蚌埠医学院手机新闻网站制作

news/2024/5/20 23:54:10/文章来源:https://blog.csdn.net/weixin_34253126/article/details/89587187

最终效果（不看效果就讲过程都是耍流氓）：
列表页、详情页展示

实现过程如下:

框架：

Scrapy：数据采集
Django：数据呈现

目标网站：
蚌埠医学院学院新闻列表:http://www.bbmc.edu.cn/index.php/view/viewcate/0/

蚌埠医学院学院新闻列表截图

第一步：数据抓取

新建爬虫项目

在终端中执行命令

srapy startproject bynews

执行完毕，自动新建好项目文件
爬虫项目目录

编写爬虫代码

在爬虫目录spider中，新建具体爬虫的by_spider.py文件：

by_spider文件

爬虫代码功能说明：

新闻列表自动翻下一页，直到结束，每个新闻列表页提取进如新闻详情页的url
逐个新闻详情页面进入，提取新闻名称，发文机构，发文时间，新闻内容

爬虫源代码内容：

import scrapy
from bynews.items import BynewsItem
import re
class BynewsSpider(scrapy.Spider):name = 'news'start_urls = ['http://www.bbmc.edu.cn/index.php/view/viewcate/0/']allowed_domains = ['bbmc.edu.cn']def parse(self, response):news = response.xpath("//div[@class='cate_body']//ul/li")next_page = response.xpath("//div[@class='pagination']/a[contains(text(),'>')]/@href").extract_first()href_urls = response.xpath("//div[@class='cate_body']/ul/li/a/@href").extract()for href in href_urls:yield scrapy.Request(href,callback=self.parse_item)if next_page is not None:yield scrapy.Request(next_page)def parse_item(self, response):item = BynewsItem() item['title'] = response.xpath("//div[@id='news_main']/div/h1/text()").extract_first()# 获取 作者和发文日期string = response.xpath("//div[@class='title_head']/span/text()").extract_first()# 通过分割获取文章发布日期full_date = string.split('/')year = full_date[0][-4:]month = full_date[1]day = full_date[2]item['post_date'] = year + month +day# 通过正则表达式获取作者matchObj = re.search(r'\[.*\]',string)# 去除两边的中括号string = matchObj.group()string = string[1:]string = string[:-1]item['author'] = stringcontents = response.xpath("//div[@id='news_main']/div/div/span/text()").extract_first()text = ''for content in contents :text = text + contentitem['content'] = textyield item

提取的数据结构items.py文件：

import scrapy
class BynewsItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()title = scrapy.Field()author = scrapy.Field()post_date = scrapy.Field()content = scrapy.Field()

爬虫项目的settings.py文件，增加请求头，关闭爬虫协议，仅放修改的部分，其他的默认即可：

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',
}

保存爬取数据

准备工作就绪后，执行scrapy crawl news -o news.csv命令开始抓取数据，并且报存到文件news.csv中，方便后续建站导入数据：
爬虫抓取数据过程图

采集好的数据：
csv文件图

爬虫过程结束，准备建站啦~

第二步：数据呈现

新建项目

在终端中执行命令：django-admin startproject 项目名

新建应用

执行命令：python manage.py startapp bynews新建应用
Django目录截图

Django采用MVC架构，需要写的内容比较多，就不一一放图了，直接甩上代码：

项目目录下

urls.py文件：

from django.contrib import admin
from django.urls import include,path
urlpatterns = [path('admin/', admin.site.urls),path('blog/', include('blog.urls')),# 蚌医新闻path('bynews/',include('bynews.urls')),
]

settings.py文件(仅放修改部分)：

INSTALLED_APPS = ['blog.apps.BlogConfig','bynews.apps.BynewsConfig','django.contrib.admin','django.contrib.auth','django.contrib.contenttypes','django.contrib.sessions','django.contrib.messages','django.contrib.staticfiles',
]

bynews应用目录下：

模型models.py文件(模型文件弄好后，需要执行数据迁移什么的命令，Django保证了所有操作都是基于面向对象，十分强大)：

from django.db import models
class Bynews(models.Model):title = models.CharField(max_length=30, default='Title')author = models.CharField(null = True,max_length=30)content= models.TextField(null = True)def __str__(self):return self.title

视图views.py文件：

from django.shortcuts import render,get_object_or_404
from .models import Bynewsdef index(request):news = Bynews.objects.order_by('id')[:30]return render(request, 'bynews/newslist.html',{'news_list':news})def detail(request,news_id):news = get_object_or_404(Bynews,pk=news_id)return render(request, 'bynews/detail.html', {'news':news})

urls.py文件：

from django.urls import path 
from . import viewsapp_name = 'bynews'urlpatterns = [# ex:/bynews/path('', views.index, name='index'),# ex:/bynews/4/path('<int:news_id>/', views.detail, name = 'detail'),#path('<int:news_id>/vote/', views.form ,name='form'),
]

在应用目录下新建template文件夹用来存放模板文件：

放上一个新闻列表页的模板newslist文件，需要注意其中的url生成方式，static的生成方式:

{% load static %}
<!DOCTYPE html>
<html lang="en" class="app">
<head><meta charset="utf-8" /><meta name="viewport" content="initial-scale=1.0, maximum-scale=1.0, user-scalable=no" /><meta name="description" content="" /><meta name="keywords" content="" /><link rel="stylesheet" type="text/css" href="{% static 'bynews/css/style.css' %}"><title>蚌埠医学院 - 新闻列表</title>
</head>
<body><header class="header"><h1>蚌埠医学院</h1><h2>新闻列表</h2></header><ul class="list">{% for news in news_list %}<li><a href="default.htm"><i class="fl"></i><span class="fl"><a href="{% url 'bynews:detail' news.id %}" >{{news.title}}</a></span><em class="fl"> </em></a></li>{% endfor %}</ul><script src="{% static 'bynews/js/vue.js' %}" type="text/javascript"></script><script></script>
</body>
</html>

最终效果(请忽视正文内容不全问题，这是采集没弄好，懒得弄了，基本想法实现了就行)：

列表页、详情页展示

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.luyixian.cn/news_show_786741.aspx

如若内容造成侵权/违法违规/事实不符，请联系dt猫网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！