网页抓取：抓取多个网址

原文：https://www.studytonight.com/python/web-scraping/scraping-multiple-urls

本教程只是指导你如何在多个网址上一起进行网页抓取，尽管你会在需要的时候想到它。

你们中的一些人可能已经猜到了，是的，我们将使用for循环。

我们将从创建一个数组来存储其中的网址开始，

## array holding url values
urls = ['https://xyz.com/page/1', 'https://xyz.com/page/2']

一个数组中可以有很多网址。不过，建议一句话，不要包含任何不必要的网址，因为每当我们对任何网址提出请求时，网站所有者都会因为向他们的服务器提出额外请求而付出代价。

永远不要在无限循环中运行网页抓取脚本

一旦你创建了一个数组，从头开始一个循环，并在循环中做所有的事情:

## importing bs4, requests, fake_useragent and csv modules
import bs4
import requests
from fake_useragent import UserAgent
import csv

## create an array with URLs
urls = ['https://xyz.com/page/1', 'https://xyz.com/page/2']

## initializing the UserAgent object
user_agent = UserAgent()

## starting the loop
for url in urls:
    ## getting the reponse from the page using get method of requests module
    page = requests.get(url, headers={"user-agent": user_agent.chrome})

    ## storing the content of the page in a variable
    html = page.content

    ## creating BeautifulSoup object
    soup = bs4.BeautifulSoup(html, "html.parser")

    ## Then parse the HTML, extract any data
    ## write it to a file

当你在一个脚本中运行多个 URL，并且也想将数据写入一个文件时，确保你以元组的形式存储数据，然后将其写入文件。

下一个教程是一个简单的练习，你必须在今晚的学习网站上运行网页抓取脚本。兴奋吗？前往下一个教程。