如何使用 Ubuntu VPS 构建自己的网络爬虫

如何使用 Ubuntu VPS 构建自己的网络爬虫 - 可打印的版本

+- Netflix优惠码论坛 (https://www.sinovoter.com)
+-- 版块：合租专栏 (https://www.sinovoter.com/forum-18.html)
+--- 版块： VPS 主机服务器优惠 (https://www.sinovoter.com/forum-20.html)
+--- 主题：如何使用 Ubuntu VPS 构建自己的网络爬虫 (/thread-8134.html)

如何使用 Ubuntu VPS 构建自己的网络爬虫 - netflix - 09-11-2023

如果您想学习如何使用 VPS 构建自己的网络爬虫，您是否考虑过使用 Scrapy？现在，我们将介绍 Scrapy 网络爬虫应用程序的基本功能。
Scrapy 是一个开源应用程序，用于从网站中提取数据。它的框架是用 Python 开发的，它使您的 VPS 能够以快速、简单和可扩展的方式执行爬虫任务。
如何在 Ubuntu 上安装 Scrapy
正如我们之前提到的，Scrapy 依赖于 Python、开发库和 pip 软件。
Python 的最新版本应该预先安装在您的 Ubuntu VPS 上。从那里开始，我们只需要在安装 Scrapy 之前安装 pip 和 python 开发人员库。
在继续之前，让我们确保我们的系统是最新的。因此，让我们登录到我们的系统并使用以下命令获得 root 权限：

代码:
> sudo -i

我们现在可以使用以下两个命令确保一切都是最新的：

代码:
> apt-get update

> apt-get install python

在下一步中，我们将安装 Pip。Pip 是 python 包索引器的 easy_install 的替代品。它用于安装和管理 Python 包。我们可以使用以下命令执行该安装：

代码:
> apt-get install python-pip

安装 Pip 后，我们必须使用以下命令安装 python 开发库。

代码:
> apt-get install python-dev

如果缺少这个包，安装 Scrapy 会产生关于 python.h 头文件的错误。确保在继续安装的后续步骤之前检查上一个命令的输出。
Scrapy 框架可以从 deb 包中安装。尝试运行以下命令：

代码:
> pip install scrapy

安装将需要一些时间，并应以以下消息结束：

引用:“Successfully installed scrapy queuelib service-identity parsel w3lib PyDispatcher cssselect Twisted pyasn1 pyasn1-modules attrs constantly incremental
Cleaning up...”

如果你看到了，你已经成功安装了 Scrapy，你现在可以开始爬网了！
在开始抓取之前，您必须设置一个新的 Scrapy 项目。输入您要存储代码并运行的目录：

代码:
> scrapy startproject myProject

这将创建一个包含以下内容的“myProject”目录：

代码:
- scrapy.cfg - the project configuration file - myProject/

- you'll import your code from here

- items.py - project items definition file

- pipelines.py - project pipelines file

- settings.py - project settings file

- spiders/ - a directory where you'll later put your spiders

我们现在将创建我们的第一个蜘蛛并执行它以从网络上收集一些信息。
蜘蛛是您定义的类。Scrapy 使用蜘蛛从一个网站（或一组网站）中抓取信息。这是我们的第一个 Spider 的代码。将其保存在项目中“myProject/spiders”目录下名为“quotes_spider.py”的文件中：

代码:
import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

def start_requests(self):

urls = [

'http://quotes.toscrape.com/page/1/',

'http://quotes.toscrape.com/page/2/',

]

for url in urls:

yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):

page = response.url.split("/")[-2]

filename = 'quotes-%s.html' % page

with open(filename, 'wb') as f:

f.write(response.body)

self.log('Saved file %s' % filename)

这段代码所做的基本上是浏览以下两个包含来自不同作者的引用的网页，并将它们保存在名为 quote-1.html 和 quote-2.html 的 html 文件中：
http://quotes.toscrape.com/page/1/
http://quotes.toscrape.com/page/2/
一旦您保存了包含代码的文件，您就可以使用以下两个命令执行您的第一个爬虫：

代码:
> cd myProject

> scrapy crawl quotes

蜘蛛的执行应以以下行结束：

引用:“…..[scrapy] INFO: Spider closed (finished)”

如果您列出当前目录中的文件，您应该会看到蜘蛛生成的新 html 文件：

引用:quotes-1.html
quotes-2.html

在下面的示例中，我们将提取每个作者的信息，按照他们页面的链接，并将结果保存在 JSON Lines 格式的文件中。我们首先需要创建一个名为 author_spider.py 的新蜘蛛，其内容如下：

代码:
import scrapy

class AuthorSpider(scrapy.Spider):

name = 'author'

start_urls = ['http://quotes.toscrape.com/']

def parse(self, response):

# follow links to author pages

for href in response.css('.author+a::attr(href)').extract():

yield scrapy.Request(response.urljoin(href),

callback=self.parse_author)

# follow pagination links

next_page = response.css('li.next a::attr(href)').extract_first()

if next_page is not None:

next_page = response.urljoin(next_page)

yield scrapy.Request(next_page, callback=self.parse)

def parse_author(self, response):

def extract_with_css(query):

return response.css(query).extract_first().strip()

yield {

'name': extract_with_css('h3.author-title::text'),

'birthdate': extract_with_css('.author-born-date::text'),

'bio': extract_with_css('.author-description::text'),

}

我们现在可以使用以下命令执行这个新的爬虫：

代码:
> scrapy crawl author -o author.jl

这将创建一个名为 author.jl 的文件，其中包含提取的内容。JSON Lines 格式很有用，因为它类似于流，您可以轻松地将新记录附加到它。
这只是 Scrapy 应用程序的简要概述。看起来你可以在你的 Ubuntu VPS 上使用 Scrapy 执行一些非常复杂的任务。
如果您想了解更多关于 Scrapy 的信息，最好的办法是深入了解Scrapy 的文档。