抓取某一个网站整站的记录

news/2024/5/8 23:46:37/文章来源:https://blog.csdn.net/weixin_34293246/article/details/92055399

2019独角兽企业重金招聘Python工程师标准>>> hot3.png

      经常由于某些原因我们需要爬取某一个网站或者直接复制某一个站点,到网上找了很多工具进行测试,试了很多各有各的问题,最终选择了Teleport Ultra,用起来效果很好;具体的操作手册等东西就不在这里说了,网上搜索一下有很多,这里主要说遇到的问题:

软件下载地址:http://download.csdn.net/detail/ityouknow/9506423

工具截图:

image

测试抓取的网站为简单心理:www.jiandanxinli.com

抓取后的效果图

image

 

一般我会选择复制100级基本上也就把网站的东西全部copy下来了,但是因为Teleport Ultra 是以UTF-8的编码进行的抓取如果文件中有中文字符,或者gbk编码的文件就会出现乱码如下图:

image

 

当然手动在浏览器选择UTF-8也可以,但是咱不能每次打开都这样干吧。于是到网站找到一款软件叫:TelePort乱码修复工具(siteRepair-v2.0),经过测试可以解决乱码的问题,这款工具也会清除一些无效的链接和html符号等。

软件下载地址:http://download.csdn.net/detail/ityouknow/9506429

软件截图:

image

 

绝大数网站再经过这两个步骤应该都已经OK了,但是有的网站的层级结构中用到了中文目录或者中文的文件名就会出现乱码,类似下面的URL地址:

http://www.xxxx.com/.com/question/除了加锁,还有什么方法解决资源竞争的问题?/解决方案.html

这样网站的结构抓取下来就会出现两种乱码:1)文件夹名乱码 2)文件名乱码

遇到这个问题siteRepair-v2.0工具就会报错,我估计是不能识别乱码的文件夹或者文件吧。

 

后来在网上找了一个PHP的程序,进行了简单的修改测试可以解决这个问题

PHP代码:convert.php

<?php
function listDir($dir)
{if(is_dir($dir)){if ($dh = opendir($dir)) {while (($file = readdir($dh)) !== false){if((is_dir($dir."/".$file)) && $file!="." && $file!=".."){rename($dir."/".$file,$dir."/".mb_convert_encoding($file,"GBK", "UTF-8"));listDir($dir."/".$file."/");}else{if($file!="." && $file!=".."){$name=rename($dir."/".$file,$dir."/".str_replace('\\','',mb_convert_encoding($file,"GBK", "UTF-8")));echo '路径:'.$dir."/".$file.'<br />';echo '结果: '.str_replace('\\','',mb_convert_encoding($file,"GBK", "UTF-8")).'<br />';}}}closedir($dh);}}
}
?>
<?php
//开始运行
listDir("./convert");?>

 

在代码的同级目录下,新建 convert文件夹,把乱码的文件放入这个目录,然后执行convert.php即可。

转载于:https://my.oschina.net/ityouknow/blog/876802

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.luyixian.cn/news_show_785206.aspx

如若内容造成侵权/违法违规/事实不符,请联系dt猫网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

防弹您的Drupal网站

“When you steal money or goods, somebody will notice it’s gone. When you steal information, most of the time no one will notice because the information is still in their possession.” – Kevin Mitnick, The Art of Deception, 2003.“当您窃取金钱或商品时&am…

Magento电子商务网站的SEO指南

Magento is only five years old but is already the most popular open-source eCommerce platform on the net, boasting a community of over 150,000 online retailers. Magento只有5岁&#xff0c;但已经是网络上最受欢迎的开源电子商务平台&#xff0c;拥有超过150,000个…

使用WordPress构建非博客网站

This post digs into how you can use WordPress to run a regular, non-blog website. 这篇文章深入探讨了如何使用WordPress来运行常规的非博客网站。 WordPress began as a blogging platform and has a long-established dominance in the world of blogging. No matter w…

2m带宽允许多少用户访问_您有多少用户需要可访问的网站?

2m带宽允许多少用户访问The Web Content Accessibility Guidelines (WCAG) came into existence in order to provide equal access and equal opportunity to people with disabilities. If the Web is accessible, many people with disabilities can communicate and intera…

谷歌深度神经网络_本周关注我们:轻松阅读,神经网络和Google召集不良网站

谷歌深度神经网络发展更好的体验 (Developing a Better Experience) With the wide variety of devices people use to browse the web today, steps are being taken to try and maintain peoples quality of experience. Google are now calling out pages that will not wor…

modern php_如何使用Modern.IE在本地测试您的网站

modern phpThis article was sponsored by Modern.IE. Thanks for supporting the sponsors that make SitePoint possible! 本文由Modern.IE赞助。 感谢您支持使SitePoint成为可能的赞助商&#xff01; There’s no shortage of front end tools to help us test the qualit…

web应用程序和web网站_Web应用程序是未来

web应用程序和web网站Native mobile apps are a little weird, if you stop and think about them. 如果您停下来考虑一下&#xff0c;本机移动应用程序会有些奇怪。 The average mobile app weighs around 20MB, often requires an internet connection in order to be used …

joomla一键部署_如何在阿里云ECS上部署和托管Joomla网站

joomla一键部署This article was originally published on Alibaba Cloud. Thank you for supporting the partners who make SitePoint possible. 本文最初发表在阿里云上 。 感谢您支持使SitePoint成为可能的合作伙伴。 Joomla! is a free and open source content manageme…

seo策略_改善参与度指标的5种基本SEO策略

seo策略Every time someone types in a search query on Google, they’re given a list of results. 每当有人在Google上输入搜索查询时&#xff0c;就会得到一个结果列表。 The way in which those results are ordered is a highly complex algorithmic process that take…

wordpress环境安装_为什么暂存环境对于WordPress网站至关重要

wordpress环境安装This article is part of a series created in partnership with SiteGround. Thank you for supporting the partners who make SitePoint possible. 本文是与SiteGround合作创建的系列文章的一部分。 感谢您支持使SitePoint成为可能的合作伙伴。 Have you …

什么是您网站的正确图像格式?

This article is part of a series created in partnership with SiteGround. Thank you for supporting the partners who make SitePoint possible. 本文是与SiteGround合作创建的系列文章的一部分。 感谢您支持使SitePoint成为可能的合作伙伴。 As of March 2017, images m…

seo服务端渲染_我应该向SEO服务收取多少费用?

seo服务端渲染Good SEO might be the very reason you’re reading this article. As its name implies, SEO is the process of optimizing a website or web page to maximize visibility in search engine results, and it is hugely important for any business that maint…

wordpress 邮件_将您的WordPress网站变成电子邮件营销机器

wordpress 邮件With over 35,000 plugins available, there are few things that you cannot do with WordPress. 有超过35,000个可用的插件&#xff0c;几乎没有WordPress无法做的事情。 For many site builders, the ultimate goal is to reach as many viewers as possible…

微信小程序选择图片优化_如何选择完美的图像格式来优化您的网站

微信小程序选择图片优化This article was sponsored by Cloudinary. Thank you for supporting the partners who make SitePoint possible. 本文由Cloudinary赞助。 感谢您支持使SitePoint成为可能的合作伙伴。 When adding images to a website, most of us will instinctiv…

Visual Studio Community 2015:设置网站

This article was sponsored by Microsoft. Thank you for supporting the sponsors who make SitePoint possible. 本文由Microsoft赞助。 感谢您支持使SitePoint成为可能的赞助商。 For this series of articles, we’re going to use Microsoft’s modern IDE: Visual Stud…

wordpress移动_如何轻松地将WordPress网站转换为移动应用程序

wordpress移动Whether you are an advanced or novice WordPress user, for many companies and organizations, having an mobile app for your website can be a huge asset for improving overall reach. Unfortunately mobile development typically is a labor intensive …

八岐大蛇蛇的伤害算谁的伤害_避免诱惑损害网站性能

八岐大蛇蛇的伤害算谁的伤害SitePoint recently published Lean Websites, by Barbara Bermes—a book that presents the latest techniques for improving web page performance. In this article, Barbara details some of the performance pitfalls that all too often cat…

css gpu加速_五个CSS性能工具可加速您的网站

css gpu加速This article is part of a series created in partnership with SiteGround. Thank you for supporting the partners who make SitePoint possible. 本文是与SiteGround合作创建的系列文章的一部分。 感谢您支持使SitePoint成为可能的合作伙伴。 In this article…

php网站开发架构_PHP的干净代码架构和测试驱动开发

php网站开发架构The Clean Code Architecture was introduced by Robert C. Martin on the 8light blog. The idea was to create an architecture which is independent of any external agency. Your business logic should not be coupled to a framework, a database, or t…

小型企业服务器选择_小型企业的最佳免费和廉价网站选择

小型企业服务器选择Not long ago, if you wanted a website, you had to be willing to shell out at least a few grand for the most basic of static sites. I remember about ten years ago at a firm I worked for, quotes for ecommerce sites, or sites with other dyna…