T-ACG-不熬夜吧-Python爬虫系统-爬取小说网站-20行代码实现一个简易的爬虫脚本

...

开篇

#0 Python爬虫系统-爬取小说网站

介绍

#1 爬虫技术介绍

准备

#2 爬虫技术前置准备

神器的使用

#3 HTML提取神器beautifulsoup4
#4 20行代码实现一个简易的爬虫脚本

84

| |

20行代码实现一个简易的爬虫脚本

简单抓取

main.py

import requests
response = requests.get("https://www.bing.com/")

抓取并解析HTML结构

main.py

import requests
from bs4 import BeautifulSoup

# initialize the list of discovered urls
# with the first page to visit
urls = ["https://www.xbiqugew.com"]

# until all pages have been visited
while len(urls) != 0:
	# get the page to visit from the list
	current_url = urls.pop()

	# crawling logic
	response = requests.get(current_url)
	soup = BeautifulSoup(response.content, "html.parser")

	link_elements = soup.select("a[href]")
	for link_element in link_elements:
		url = link_element['href']
		if "https://www.xbiqugew.com" in url:
			urls.append(url)
	print(urls)

84 🛠️系统设计与开发 ↦ Python爬虫系统-爬取小说网站 __ 87 字

Python爬虫系统-爬取小说网站 #4

只能搜到实用的知识！

简单抓取

抓取并解析HTML结构

先行测试阶段