豆瓣爬取top250

发表于 2020-10-19 更新于 2020-12-08 分类于爬虫

免费ip池代理和随机请求头，防一手ip被ban

从一个爬虫作业说起

看视频混个眼熟：https://www.icourse163.org/learn/BIT-1001870001#/learn/announce （大概两三个小时看完）

一 requests对象

参数

params：字典或字节序列，作为参数增加到url中

kv={'key1':'value1','key2':'value2'}
r=requests.request('GET','http://python123.io/ws',params=kv)
print(r.ur1)
# http://python123.io/ws?key1=value1&key2=value2

data：字典、字节序列或文件对象，作为Request的内容

kv={'key1':'value1','key2':'value2'}
r=requests.request('POST','http://python123.io/ws',data=kv）
body="主体内容"
r=requests.request（'POST','http://python123.io/ws',data=body)

json: JSON格式的数据，作为Request的内容

1 2	kv={'key1':'value1'} r=requests.request('POST','http://python123.io/ws',json=kv)

headers：字典，HTTP定制头

1 2	hd={'user-agent':'Chrome/10'} r=requests.request('POST','http://python123.io/ws',headers=hd)

cookies：字典或CookieJar，Request中的cookie
auth：元组，支持HTTP认证功能

files：字典类型，传输文件

1 2	fs={'file':open('data.x1s','rb')} r=requests.request('POST','http://python123.io/ws',files=fs)

timeout：设定超时时间，秒为单位

1	r=requests.request('GET','http://vwww.baidu.com',timeout=10)

proxies：字典类型，设定访问代理服务器，可以增加登录认证

pxs={
    'http':'http://user:pass@1e.10.10.1:1234',
	'https':'https://10.10.10.1:4321'
}
r=requests.request('GET','http://www.baidu.com',proxies=pxs)

allow_redirects: True/False，默认为True，重定向开关
stream: True/False，默认为True，获取内容立即下载开关
verify: True/False，默认为True，认证SSL证书开关
cert：本地SSL证书路径

方法

requests.request(method,url,**kwargs)

requests.get(url,params=None,**kwargs)
requests.head(url,**kwargs)
requests.post(url,data=None,json=None,**kwargs)
requests.put(url,data=None,**kwargs)
requests.patch(url,data=None,**kwargs)
requests.delete(url,**kwargs)

HTTP资源

返回对象

异常

二请求头

爬虫爬取异常，返回错误码503，考虑修改请求头

常用请求头：

user_agent = [
    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
    "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
    "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
    "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
    "Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
    "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
    "Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
    "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
    "Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
    "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
    "Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
    "UCWEB7.0.2.37/28/999",
    "NOKIA5700/ UCWEB7.0.2.37/28/999",
    "Openwave/ UCWEB7.0.2.37/28/999",
    "Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999",
	"Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25",
]
headers = {'User-Agent': random.choice(user_agent)}
# 随机获取一个请求头
def get_user_agent():
    return random.choice(USER_AGENTS)

使用fake_useragent模块随机生成请求头，（其实是一个从一个API获取到的）

# 随机heard头
import random
from fake_useragent import UserAgent
# 随机生成User-Agent
def get_random_ua():
    ua = UserAgent()        # 创建User-Agent对象
    useragent = ua.random
    return useragent
kv = {"user-agent": get_random_ua()}  # 将请求标识改为标准的浏览器编识，返回状态码是403的话考虑这么做
r = requests.get(url, timeout=50, headers=kv)  # 发送请求并取得返回值

# coding:utf-8
import requests
from bs4 import BeautifulSoup
cookie = '''
"cisession":"19dfd70a27ec0eecf1fe3fc2e48b7f91c7c83c60",
"CNZZDATA100020196":"1815846425-1478580135-https%253A%252F%252Fwww.baidu.com%252F%7C1483922031",
"Hm_lvt_f805f7762a9a237a0deac37015e9f6d9":"1482722012,1483926313",
"Hm_lpvt_f805f7762a9a237a0deac37015e9f6d9":"1483926368"
'''
header = {
	'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',
	'Connection': 'keep-alive',
	'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
	'Cookie': cookie
}
url = 'https://kankandou.com/book/view/22353.html'
wbdata = requests.get(url,headers=header).text
soup = BeautifulSoup(wbdata,'lxml')
print(soup)

使用requests插入Cookie

# coding:utf-8
import requests
from bs4 import BeautifulSoup
cookie = {
"cisession":"19dfd70a27ec0eecf1fe3fc2e48b7f91c7c83c60",
"CNZZDATA100020196":"1815846425-1478580135-https%253A%252F%252Fwww.baidu.com%252F%7C1483922031",
"Hm_lvt_f805f7762a9a237a0deac37015e9f6d9":"1482722012,1483926313",
"Hm_lpvt_f805f7762a9a237a0deac37015e9f6d9":"1483926368"
}
url = 'https://kankandou.com/book/view/22353.html'
wbdata = requests.get(url,cookies=cookie).text
soup = BeautifulSoup(wbdata,'lxml')
print(soup)

四代理

同一个IP访问过于频繁，可能会被ban，可以考虑挂代理IP池访问。

介绍

匿名度：
　- 透明：知道是代理ip，也会知道你的真实ip
　- 匿名：知道是代理ip，不会知道你的真实ip
　- 高匿：不知道是代理ip，不会知道你的真实ip

类型：

http：只能请求http开头的url
https：只能请求https开头的url

分类：

正向代理：代理客户端获取数据。正向代理是为了保护客户端防止被追究责任
反向代理：代理服务器提供数据。反向代理是为了保护服务器或负责负载均衡

代理使用

# 代理的使用
import requests
proxy_list = [
     {"http": "112.115.57.20:3128"},        
     {'http': '121.41.171.223:3128'}
]
# 随机获取代理IP
proxy = random.choice(proxy_list)
# 标准请求头
headers = {
     'User-Agent': 'Mozilla/5.0'
}
# 测试连接
url = 'https://www.baidu.com/s?wd=ip'
# 拿到页面
page_text = requests.get(url=url,headers=headers,proxies=proxy).text

免费IP池

从免费ip提供网站爬取ip port

快代理免费：https://www.kuaidaili.com/free/inha/1/ 已测试，无效
全网IP代理免费： http://www.goubanjia.com/

# 测试快代理免费IP池的可用性
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent


class GetProxy(object):
    """测试快代理提供的免费proxy有效性"""
    def __init__(self):
        self.url = 'https://www.kuaidaili.com/free/inha/1/'

    def get_random_ua(self):
        """随机生成User-Agent"""
        ua = UserAgent()  # 创建User-Agent对象
        useragent = ua.random
        return useragent

    def get_proxy(self):
        """快代理网站上获取随机的代理IP"""

        headers = {'User-Agent': self.get_random_ua()}
        html = requests.get(url=self.url, headers=headers, timeout=5).content.decode('utf-8', 'ignore')
        soup = BeautifulSoup(html, "html.parser")
        proxy = []
        for ip, port in zip(soup("td", attrs={"data-title": "IP"}), soup("td", attrs={"data-title": "PORT"})):
            proxy.append(ip.get_text() + ":" + port.get_text())
        print(proxy)
        return proxy


def test_proxy(self, proxy):
    """测试抓取的代理IP是否可用"""
    L = proxy.split(':')
    proxy_ip = {
        'http': 'http://{}:{}'.format(L[0], L[1]),
        'https': 'https://{}:{}'.format(L[0], L[1])
    }
    test_url = 'https://www.baidu.com/'
    try:
        res = requests.get(url=test_url, proxies=proxy_ip, timeout=8)
        if res.status_code == 200:
            print(L[0], ":", L[1], 'Success')
            with open('proxies.txt', 'a') as f:
                f.write(L[0] + ':' + L[1] + '\n')
    except Exception as e:
    	print(L[0], L[1], 'Failed')


def main(self):
    for proxy in self.get_proxy():
        self.test_proxy(proxy)


if __name__ == '__main__':
    spider = GetProxy()
    spider.main()

# 从文件读取，并转化为可用数据
import random
def get_proxies():
    with open('proxies.txt', 'r') as f:
        result = f.readlines()                  # 读取所有行并返回列表
        proxy_ip = random.choice(result)[:-1]       # 深拷贝获取所有代理IP，随机抽取
        L = proxy_ip.split(':')
        proxy_ip = {
            'http': 'http://{}:{}'.format(L[0], L[1]),
            'https': 'https://{}:{}'.format(L[0], L[1])
        }
        return proxy_ip
# 直接返回给requests.get(url=url,headers=headers,proxies=get_proxies())

私密代理

常用在vpn等方面。

# 私密代理格式
proxies = {
	'协议':'协议://用户名:密码@IP:端口号'
}
proxies = {
    'http':'http://用户名:密码@IP:端口号',
    'https':'https://用户名:密码@IP:端口号'
}
proxies = {
    'http': 'http://309435365:szayclhp@106.75.71.140:16816',
    'https':'https://309435365:szayclhp@106.75.71.140:16816',
}

五 BeatifulSoup库

漂亮汤官方文档需要代理

也可以直接看这个快速上手：简书

import requests
from bs4 import BeautifulSoup

r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo, "html.parser")
"""
解析器有：
    bs4的html解析器：    BeautifulSoup(mk,"html.parser")    [安装bs4库]
    lxml的html解析器:    BeautifulSoup(mk,"lxml")           [pip install lxml]
    lxml的xml解析器：    BeautifulSoup(mk,"xml")             [pip install lxml]
    html5lib的解析器:   BeautifulSoup(mk,"html5lib")        [pip install html5lib]
"""
print(soup.prettify())  # 格式化
print(soup.title)   # 标题包括尖括号
print(soup.title.string)    # 标题尖括号本身链接的内容
print(soup.head.contents)    # [<title>this is a python demo page</title>]
print(soup.a)   # 第一个a标签内容
print(soup.a.name)  # a标签的标签名（a）
print(soup.a.parent.name)   # a标签的父标签标签名
print(soup.a.parent.parent.name)    # 父标签可以嵌套
print(soup.a.attrs, type(soup.a.attrs))  
# a标签的一整个属性内容，打包成字典，不包含a标签的的string本身内容
print(soup.a.attrs['class'])    # 对a标签字典内容的键取值


"""
...更多内容直接看官方文档吧
"""

个人常用的几个解析函数：

六 yaml文件读写

写

# 文件写
import yaml
movies = []
movie = {
    "top": movie_num,
    "titles": movie_titles_list,
    "star": movie_star,
    "link": movie_link,
    "cover": movie_cover,
    "quote": movie_quote,
    "director": movie_director,
    "actor": movie_actor,
    "year": movie_year,
    "country": movie_country,
    "theme": movie_theme,
    "story": movie_story,
}
movies.append(movie)
with open("movie.yaml", "wb") as f:
    # "w"只能写字符串，字典里有些键的值是数字，懒得转了，直接用wb，最后dump还是会编码为utf-8
    f.write(yaml.dump(movies, encoding='utf-8', allow_unicode=True, indent=4))
    # data数据中有汉字时，加上：encoding='utf-8',allow_unicode=True

读

# 文件读
import yaml
import os
from billboard import get_HTML
# 从yaml读取并获取zip(cover)
with open("movie.yaml", "rb") as f:
    movie_list = yaml.load(f.read(), Loader=yaml.FullLoader)
urls = []  # 从yaml文件获取图片链接地址
titles = []
top = 0
# 将封面链接和标题分块打包
for movie in movie_list:
    urls.append(movie["cover"])
    titles.append(movie["titles"][0])
# [(标题,封面链接),(标题,封面链接)...]
for title, url in zip(titles, urls):
    top += 1
    path = "./pic/" + title + ".jpg"
    # 懒下载
    if os.path.isfile(path):
        print("top", top, ":", title, "<", url, ">：已经下载！")
    else:
        with open(path, "wb") as f:
            f.write(get_HTML(url, "content"))
        print("top", top, ":", title, "<", url, ">：已经下载！")

七请求时延

# 模拟人手动访问网页，网站这还ban建议诅咒阿三程序员直接去世
# 在循环中插入
import time
import random
time.sleep(random.randint(4,5))  # 4s-5s取随机

八实例

从豆瓣爬取TOP250电影

billboard.py

import requests
from bs4 import BeautifulSoup
import re
import os
import yaml
import random
import time
from fake_useragent import UserAgent


def get_random_ua():
    """随机生成User-Agent"""
    ua = UserAgent()  # 创建User-Agent对象
    useragent = ua.random
    print("useragent", useragent)
    return useragent


def get_HTML(url, type):
    """获取整个页面"""
    try:
        proxies = {     # 没用的代理
            'http': '39.81.60.251:9000',
            'https': '123.55.106.158:9999',
        }
        kv = {"user-agent": get_random_ua()}  # 将请求标识改为标准的浏览器编识，返回状态码是403的话考虑这么做
        r = requests.get(url, timeout=50, headers=kv)  # 发送请求并取得返回值
        print(r.request.headers)
        r.raise_for_status()  # 如果状态码不是200，引发HTTPError异常
        r.encoding = 'utf-8'  # r.encoding代表头文件提示编码，r.apparent_encoding代表分析内容得出的编码
        if type == "text":
            return r.text  # HTTP响应内容的字符串形式，content为二进制形式
        elif type == "content":
            return r.content
    except:
        exit("大概是豆瓣把你禁了")


if __name__ == "__main__":
    if not os.path.isdir("pic"):  # 无文件夹时创建
        os.makedirs("pic")
    i = 0
    movies = []
    while i <= 25:  # 豆瓣提供250部
        url = f"https://movie.douban.com/top250?start={i}&filter="
        # print(get_HTML_text(url)[:16])
        soup = BeautifulSoup(get_HTML(url, "text"), "html.parser")
        for li in soup.select("ol.grid_view > li"):
            print("------------------------------------------")
            # print(li("em")[0].string) # 排名
            movie_num = int(li("em")[0].string)
            print("top:", movie_num)  # 排名
            # print(li("span", class_="title")) # 标题
            movie_titles = re.sub(r'(<span class="title">|</span>|\xa0|\/|\[|\])', "",
                                  str(li("span", class_="title")))
            movie_titles_list = re.split(r', ', movie_titles)  # 列表化标题
            print("titles:", movie_titles, movie_titles_list)  # 标题
            # print(li(attrs={"property": "v:average"})) # 评分
            movie_star = str(li(attrs={"property": "v:average"}))
            movie_star = float(re.findall(r'\d.\d', movie_star)[0])  # 评分
            print("star:", movie_star)
            # print(str(li("a")[0]))    # 链接
            movie_link = str(re.findall(r'href=".+"', str(li("a")[0])))
            movie_link = re.sub(r'(href="|"|\'|\[|\])', "", movie_link).strip()
            print("link:", movie_link)  # 链接
            movie_cover = str(re.findall(r'src=".+"', str(li("a")[0])))
            movie_cover = re.sub(r'(src="|"|\'|\[|\]|width.+|\xa0)', "", movie_cover).strip()
            print("cover:", movie_cover)  # 封面链接
            # 频繁访问页面，防止ip被ban，设置时延  采用ip池可以关闭此项
            time.sleep(random.randint(4, 6))
            soup_detail = BeautifulSoup(get_HTML(movie_link, "text"), "html.parser")  # 前往片子的详细地址爬取电影大纲
            movie_story = str(soup_detail.find_all(attrs={"property": "v:summary"}))
            movie_story = re.sub(r'(<span| class=""| property="v:summary">)|<br/>|</span>|\[|\]|  *|\n', "",
                                 movie_story)
            movie_story = re.sub(r'　　', '\n\t', movie_story)  # 排一下段落
            print("story:", movie_story)  # 大纲
            try:
                # print(li("p"))  # 介绍
                movie_quote = str(li("p")[1])  # 谏言
                movie_quote = re.sub(r'(<p class="quote">|\n|<span class="inq">|</span>|</p>)', "", movie_quote)
                print("quote:", movie_quote)  # 谏言
                movie_intro = str(li("p")[0])  # 介绍
                movie_intro = re.sub(r'(<p class="">|\n|<br/>|</p>)', "", movie_intro)
                movie_intro = re.split(r' {28}', movie_intro)
                movie_intro.remove("")
                # print(movie_intro) # 介绍
                movie_director, movie_actor = re.split(r'\xa0+', movie_intro[0])[0], re.split(r'\xa0+', movie_intro[0])[1]
                movie_year, movie_country, movie_theme = movie_intro[1].split("\xa0/\xa0")
                movie_year = int(movie_year)
                movie_country_list = movie_country.split(" ")
                movie_theme = re.sub(r'  +', '', movie_theme)
                movie_theme_list = movie_theme.split(" ")
                print("director：", movie_director)  # 导演
                print("actor：", movie_actor)  # 主演
                print("year：", movie_year)  # 年份
                print("country：", movie_country, movie_country_list)  # 国家
                print("theme：", movie_theme, movie_theme_list)  # 题材
            except:
                print("此片子爬不到！！")
                movie_quote = "未爬取"
                movie_director = "未爬取"
                movie_actor = "未爬取"
                movie_year = "未爬取"
                movie_country_list = "未爬取"
                movie_theme_list = "未爬取"
            movie = {
                "top": movie_num,
                "titles": movie_titles_list,
                "star": movie_star,
                "link": movie_link,
                "cover": movie_cover,
                "quote": movie_quote,
                "director": movie_director,
                "actor": movie_actor,
                "year": movie_year,
                "country": movie_country_list,
                "theme": movie_theme_list,
                "story": movie_story,
            }
            movies.append(movie)
        i += 25
    with open("movie.yaml", "wb") as f:
        # data数据中有汉字时，加上：encoding='utf-8',allow_unicode=True
        f.write(yaml.dump(movies, encoding='utf-8', allow_unicode=True, indent=4))

billboard.py运行结果：

movie.yaml

从movie.yaml读取信息并运行billboard_cover.py

import yaml
import os
from billboard import get_HTML
# 从yaml读取并获取zip(cover)


with open("movie.yaml", "rb") as f:
    movie_list = yaml.load(f.read(), Loader=yaml.FullLoader)
urls = []  # 从yaml文件获取图片链接地址
titles = []
top = 0
for movie in movie_list:
    urls.append(movie["cover"])
    titles.append(movie["titles"][0])
for title, url in zip(titles, urls):
    top += 1
    path = "./pic/" + title + ".jpg"
    if os.path.isfile(path):
        print("top", top, ":", title, "<", url, ">：已经下载！")
    else:
        with open(path, "wb") as f:
            f.write(get_HTML(url, "content"))
        print("top", top, ":", title, "<", url, ">：已经下载！")

billboard_cover.py

运行结果：

参考

常用浏览器请求头: https://pengshiyu.blog.csdn.net/article/details/80182397

代理: https://www.jianshu.com/p/5958b0cb3de2

代理服务:

穷逼看这里：（测了一下快代理的免费名单，炸了，大概可能是我的姿势不对吧）

快代理：https://www.kuaidaili.com/free/inha/1/

全国代理ip：http://www.goubanjia.com/
石油佬：

讯代理：http://www.xdaili.cn/

蜻蜓代理：https://proxy.horocn.com/

爬虫使用cookie: https://blog.csdn.net/weixin_38706928/article/details/80376572

yaml文件读写: https://blog.csdn.net/lmj19851117/article/details/78843486

漂亮汤官方文档： https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

mednight4

豆瓣爬取top250

从一个爬虫作业说起

一 requests对象

参数

方法

HTTP资源

返回对象

异常

二请求头

直接将Cookie写在header头部

使用requests插入Cookie

四代理

介绍

代理使用

免费IP池

私密代理

五 BeatifulSoup库

六 yaml文件读写

写

读

七请求时延

八实例

参考

从一个爬虫作业说起

一 requests对象

参数

方法

HTTP资源

返回对象

异常

二 请求头

三 Cookie

直接将Cookie写在header头部

使用requests插入Cookie

四 代理

介绍

代理使用

免费IP池

私密代理

五 BeatifulSoup库

六 yaml文件读写

写

读

七 请求时延

八 实例

参考

二请求头

四代理

七请求时延

八实例