豆瓣爬取top250

从一个爬虫作业说起

看视频混个眼熟:https://www.icourse163.org/learn/BIT-1001870001#/learn/announce (大概两三个小时看完)

一 requests对象

参数

  • params:字典或字节序列,作为参数增加到url中

    1
    2
    3
    4
    kv={'key1':'value1','key2':'value2'}
    r=requests.request('GET','http://python123.io/ws',params=kv)
    print(r.ur1)
    # http://python123.io/ws?key1=value1&key2=value2
  • data:字典、字节序列或文件对象,作为Request的内容

    1
    2
    3
    4
    kv={'key1':'value1','key2':'value2'}
    r=requests.request('POST','http://python123.io/ws',data=kv)
    body="主体内容"
    r=requests.request('POST','http://python123.io/ws',data=body)
  • json: JSON格式的数据,作为Request的内容

    1
    2
    kv={'key1':'value1'}
    r=requests.request('POST','http://python123.io/ws',json=kv)
  • headers:字典,HTTP定制头

    1
    2
    hd={'user-agent':'Chrome/10'}
    r=requests.request('POST','http://python123.io/ws',headers=hd)
  • cookies:字典或CookieJar,Request中的cookie

  • auth:元组,支持HTTP认证功能

  • files:字典类型,传输文件

    1
    2
    fs={'file':open('data.x1s','rb')}
    r=requests.request('POST','http://python123.io/ws',files=fs)
  • timeout:设定超时时间,秒为单位

    1
    r=requests.request('GET','http://vwww.baidu.com',timeout=10)
  • proxies:字典类型,设定访问代理服务器,可以增加登录认证

    1
    2
    3
    4
    5
    pxs={
    'http':'http://user:pass@1e.10.10.1:1234',
    'https':'https://10.10.10.1:4321'
    }
    r=requests.request('GET','http://www.baidu.com',proxies=pxs)
  • allow_redirects: True/False,默认为True,重定向开关

  • stream: True/False,默认为True,获取内容立即下载开关

  • verify: True/False,默认为True,认证SSL证书开关

  • cert:本地SSL证书路径

方法

requests.request(method,url,**kwargs)

  • requests.get(url,params=None,**kwargs)
  • requests.head(url,**kwargs)
  • requests.post(url,data=None,json=None,**kwargs)
  • requests.put(url,data=None,**kwargs)
  • requests.patch(url,data=None,**kwargs)
  • requests.delete(url,**kwargs)

HTTP资源

返回对象

异常

二 请求头

爬虫爬取异常,返回错误码503,考虑修改请求头

常用请求头:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
user_agent = [
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
"Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
"Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
"Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
"Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
"UCWEB7.0.2.37/28/999",
"NOKIA5700/ UCWEB7.0.2.37/28/999",
"Openwave/ UCWEB7.0.2.37/28/999",
"Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999",
"Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25",
]
headers = {'User-Agent': random.choice(user_agent)}
# 随机获取一个请求头
def get_user_agent():
return random.choice(USER_AGENTS)

使用fake_useragent模块随机生成请求头,(其实是一个从一个API获取到的)

1
2
3
4
5
6
7
8
9
10
# 随机heard头
import random
from fake_useragent import UserAgent
# 随机生成User-Agent
def get_random_ua():
ua = UserAgent() # 创建User-Agent对象
useragent = ua.random
return useragent
kv = {"user-agent": get_random_ua()} # 将请求标识改为标准的浏览器编识,返回状态码是403的话考虑这么做
r = requests.get(url, timeout=50, headers=kv) # 发送请求并取得返回值

直接将Cookie写在header头部

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# coding:utf-8
import requests
from bs4 import BeautifulSoup
cookie = '''
"cisession":"19dfd70a27ec0eecf1fe3fc2e48b7f91c7c83c60",
"CNZZDATA100020196":"1815846425-1478580135-https%253A%252F%252Fwww.baidu.com%252F%7C1483922031",
"Hm_lvt_f805f7762a9a237a0deac37015e9f6d9":"1482722012,1483926313",
"Hm_lpvt_f805f7762a9a237a0deac37015e9f6d9":"1483926368"
'''
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',
'Connection': 'keep-alive',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Cookie': cookie
}
url = 'https://kankandou.com/book/view/22353.html'
wbdata = requests.get(url,headers=header).text
soup = BeautifulSoup(wbdata,'lxml')
print(soup)

使用requests插入Cookie

1
2
3
4
5
6
7
8
9
10
11
12
13
# coding:utf-8
import requests
from bs4 import BeautifulSoup
cookie = {
"cisession":"19dfd70a27ec0eecf1fe3fc2e48b7f91c7c83c60",
"CNZZDATA100020196":"1815846425-1478580135-https%253A%252F%252Fwww.baidu.com%252F%7C1483922031",
"Hm_lvt_f805f7762a9a237a0deac37015e9f6d9":"1482722012,1483926313",
"Hm_lpvt_f805f7762a9a237a0deac37015e9f6d9":"1483926368"
}
url = 'https://kankandou.com/book/view/22353.html'
wbdata = requests.get(url,cookies=cookie).text
soup = BeautifulSoup(wbdata,'lxml')
print(soup)

四 代理

同一个IP访问过于频繁,可能会被ban,可以考虑挂代理IP池访问。

介绍

匿名度:
 - 透明:知道是代理ip,也会知道你的真实ip
 - 匿名:知道是代理ip,不会知道你的真实ip
 - 高匿:不知道是代理ip,不会知道你的真实ip

类型

  • http:只能请求http开头的url
  • https:只能请求https开头的url

分类:

  • 正向代理:代理客户端获取数据。正向代理是为了保护客户端防止被追究责任
  • 反向代理:代理服务器提供数据。反向代理是为了保护服务器或负责负载均衡

代理使用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 代理的使用
import requests
proxy_list = [
{"http": "112.115.57.20:3128"},
{'http': '121.41.171.223:3128'}
]
# 随机获取代理IP
proxy = random.choice(proxy_list)
# 标准请求头
headers = {
'User-Agent': 'Mozilla/5.0'
}
# 测试连接
url = 'https://www.baidu.com/s?wd=ip'
# 拿到页面
page_text = requests.get(url=url,headers=headers,proxies=proxy).text

免费IP池

从免费ip提供网站爬取ip port

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# 测试快代理免费IP池的可用性
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent


class GetProxy(object):
"""测试快代理提供的免费proxy有效性"""
def __init__(self):
self.url = 'https://www.kuaidaili.com/free/inha/1/'

def get_random_ua(self):
"""随机生成User-Agent"""
ua = UserAgent() # 创建User-Agent对象
useragent = ua.random
return useragent

def get_proxy(self):
"""快代理网站上获取随机的代理IP"""

headers = {'User-Agent': self.get_random_ua()}
html = requests.get(url=self.url, headers=headers, timeout=5).content.decode('utf-8', 'ignore')
soup = BeautifulSoup(html, "html.parser")
proxy = []
for ip, port in zip(soup("td", attrs={"data-title": "IP"}), soup("td", attrs={"data-title": "PORT"})):
proxy.append(ip.get_text() + ":" + port.get_text())
print(proxy)
return proxy


def test_proxy(self, proxy):
"""测试抓取的代理IP是否可用"""
L = proxy.split(':')
proxy_ip = {
'http': 'http://{}:{}'.format(L[0], L[1]),
'https': 'https://{}:{}'.format(L[0], L[1])
}
test_url = 'https://www.baidu.com/'
try:
res = requests.get(url=test_url, proxies=proxy_ip, timeout=8)
if res.status_code == 200:
print(L[0], ":", L[1], 'Success')
with open('proxies.txt', 'a') as f:
f.write(L[0] + ':' + L[1] + '\n')
except Exception as e:
print(L[0], L[1], 'Failed')


def main(self):
for proxy in self.get_proxy():
self.test_proxy(proxy)


if __name__ == '__main__':
spider = GetProxy()
spider.main()
1
2
3
4
5
6
7
8
9
10
11
12
13
# 从文件读取,并转化为可用数据
import random
def get_proxies():
with open('proxies.txt', 'r') as f:
result = f.readlines() # 读取所有行并返回列表
proxy_ip = random.choice(result)[:-1] # 深拷贝获取所有代理IP,随机抽取
L = proxy_ip.split(':')
proxy_ip = {
'http': 'http://{}:{}'.format(L[0], L[1]),
'https': 'https://{}:{}'.format(L[0], L[1])
}
return proxy_ip
# 直接返回给requests.get(url=url,headers=headers,proxies=get_proxies())

私密代理

常用在vpn等方面。

1
2
3
4
5
6
7
8
9
10
11
12
# 私密代理格式
proxies = {
'协议':'协议://用户名:密码@IP:端口号'
}
proxies = {
'http':'http://用户名:密码@IP:端口号',
'https':'https://用户名:密码@IP:端口号'
}
proxies = {
'http': 'http://309435365:szayclhp@106.75.71.140:16816',
'https':'https://309435365:szayclhp@106.75.71.140:16816',
}

五 BeatifulSoup库

漂亮汤官方文档 需要代理

也可以直接看这个快速上手:简书

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import requests
from bs4 import BeautifulSoup

r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo, "html.parser")
"""
解析器有:
bs4的html解析器: BeautifulSoup(mk,"html.parser") [安装bs4库]
lxml的html解析器: BeautifulSoup(mk,"lxml") [pip install lxml]
lxml的xml解析器: BeautifulSoup(mk,"xml") [pip install lxml]
html5lib的解析器: BeautifulSoup(mk,"html5lib") [pip install html5lib]
"""
print(soup.prettify()) # 格式化
print(soup.title) # 标题包括尖括号
print(soup.title.string) # 标题尖括号本身链接的内容
print(soup.head.contents) # [<title>this is a python demo page</title>]
print(soup.a) # 第一个a标签内容
print(soup.a.name) # a标签的标签名(a)
print(soup.a.parent.name) # a标签的父标签标签名
print(soup.a.parent.parent.name) # 父标签可以嵌套
print(soup.a.attrs, type(soup.a.attrs))
# a标签的一整个属性内容,打包成字典,不包含a标签的的string本身内容
print(soup.a.attrs['class']) # 对a标签字典内容的键取值


"""
...更多内容直接看官方文档吧
"""

个人常用的几个解析函数:

六 yaml文件读写

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# 文件写
import yaml
movies = []
movie = {
"top": movie_num,
"titles": movie_titles_list,
"star": movie_star,
"link": movie_link,
"cover": movie_cover,
"quote": movie_quote,
"director": movie_director,
"actor": movie_actor,
"year": movie_year,
"country": movie_country,
"theme": movie_theme,
"story": movie_story,
}
movies.append(movie)
with open("movie.yaml", "wb") as f:
# "w"只能写字符串,字典里有些键的值是数字,懒得转了,直接用wb,最后dump还是会编码为utf-8
f.write(yaml.dump(movies, encoding='utf-8', allow_unicode=True, indent=4))
# data数据中有汉字时,加上:encoding='utf-8',allow_unicode=True

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# 文件读
import yaml
import os
from billboard import get_HTML
# 从yaml读取并获取zip(cover)
with open("movie.yaml", "rb") as f:
movie_list = yaml.load(f.read(), Loader=yaml.FullLoader)
urls = [] # 从yaml文件获取图片链接地址
titles = []
top = 0
# 将封面链接和标题分块打包
for movie in movie_list:
urls.append(movie["cover"])
titles.append(movie["titles"][0])
# [(标题,封面链接),(标题,封面链接)...]
for title, url in zip(titles, urls):
top += 1
path = "./pic/" + title + ".jpg"
# 懒下载
if os.path.isfile(path):
print("top", top, ":", title, "<", url, ">:已经下载!")
else:
with open(path, "wb") as f:
f.write(get_HTML(url, "content"))
print("top", top, ":", title, "<", url, ">:已经下载!")

七 请求时延

1
2
3
4
5
# 模拟人手动访问网页,网站这还ban建议诅咒阿三程序员直接去世
# 在循环中插入
import time
import random
time.sleep(random.randint(4,5)) # 4s-5s取随机

八 实例

从豆瓣爬取TOP250电影

billboard.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
import requests
from bs4 import BeautifulSoup
import re
import os
import yaml
import random
import time
from fake_useragent import UserAgent


def get_random_ua():
"""随机生成User-Agent"""
ua = UserAgent() # 创建User-Agent对象
useragent = ua.random
print("useragent", useragent)
return useragent


def get_HTML(url, type):
"""获取整个页面"""
try:
proxies = { # 没用的代理
'http': '39.81.60.251:9000',
'https': '123.55.106.158:9999',
}
kv = {"user-agent": get_random_ua()} # 将请求标识改为标准的浏览器编识,返回状态码是403的话考虑这么做
r = requests.get(url, timeout=50, headers=kv) # 发送请求并取得返回值
print(r.request.headers)
r.raise_for_status() # 如果状态码不是200,引发HTTPError异常
r.encoding = 'utf-8' # r.encoding代表头文件提示编码,r.apparent_encoding代表分析内容得出的编码
if type == "text":
return r.text # HTTP响应内容的字符串形式,content为二进制形式
elif type == "content":
return r.content
except:
exit("大概是豆瓣把你禁了")


if __name__ == "__main__":
if not os.path.isdir("pic"): # 无文件夹时创建
os.makedirs("pic")
i = 0
movies = []
while i <= 25: # 豆瓣提供250部
url = f"https://movie.douban.com/top250?start={i}&filter="
# print(get_HTML_text(url)[:16])
soup = BeautifulSoup(get_HTML(url, "text"), "html.parser")
for li in soup.select("ol.grid_view > li"):
print("------------------------------------------")
# print(li("em")[0].string) # 排名
movie_num = int(li("em")[0].string)
print("top:", movie_num) # 排名
# print(li("span", class_="title")) # 标题
movie_titles = re.sub(r'(<span class="title">|</span>|\xa0|\/|\[|\])', "",
str(li("span", class_="title")))
movie_titles_list = re.split(r', ', movie_titles) # 列表化标题
print("titles:", movie_titles, movie_titles_list) # 标题
# print(li(attrs={"property": "v:average"})) # 评分
movie_star = str(li(attrs={"property": "v:average"}))
movie_star = float(re.findall(r'\d.\d', movie_star)[0]) # 评分
print("star:", movie_star)
# print(str(li("a")[0])) # 链接
movie_link = str(re.findall(r'href=".+"', str(li("a")[0])))
movie_link = re.sub(r'(href="|"|\'|\[|\])', "", movie_link).strip()
print("link:", movie_link) # 链接
movie_cover = str(re.findall(r'src=".+"', str(li("a")[0])))
movie_cover = re.sub(r'(src="|"|\'|\[|\]|width.+|\xa0)', "", movie_cover).strip()
print("cover:", movie_cover) # 封面链接
# 频繁访问页面,防止ip被ban,设置时延 采用ip池可以关闭此项
time.sleep(random.randint(4, 6))
soup_detail = BeautifulSoup(get_HTML(movie_link, "text"), "html.parser") # 前往片子的详细地址爬取电影大纲
movie_story = str(soup_detail.find_all(attrs={"property": "v:summary"}))
movie_story = re.sub(r'(<span| class=""| property="v:summary">)|<br/>|</span>|\[|\]| *|\n', "",
movie_story)
movie_story = re.sub(r'  ', '\n\t', movie_story) # 排一下段落
print("story:", movie_story) # 大纲
try:
# print(li("p")) # 介绍
movie_quote = str(li("p")[1]) # 谏言
movie_quote = re.sub(r'(<p class="quote">|\n|<span class="inq">|</span>|</p>)', "", movie_quote)
print("quote:", movie_quote) # 谏言
movie_intro = str(li("p")[0]) # 介绍
movie_intro = re.sub(r'(<p class="">|\n|<br/>|</p>)', "", movie_intro)
movie_intro = re.split(r' {28}', movie_intro)
movie_intro.remove("")
# print(movie_intro) # 介绍
movie_director, movie_actor = re.split(r'\xa0+', movie_intro[0])[0], re.split(r'\xa0+', movie_intro[0])[1]
movie_year, movie_country, movie_theme = movie_intro[1].split("\xa0/\xa0")
movie_year = int(movie_year)
movie_country_list = movie_country.split(" ")
movie_theme = re.sub(r' +', '', movie_theme)
movie_theme_list = movie_theme.split(" ")
print("director:", movie_director) # 导演
print("actor:", movie_actor) # 主演
print("year:", movie_year) # 年份
print("country:", movie_country, movie_country_list) # 国家
print("theme:", movie_theme, movie_theme_list) # 题材
except:
print("此片子爬不到!!")
movie_quote = "未爬取"
movie_director = "未爬取"
movie_actor = "未爬取"
movie_year = "未爬取"
movie_country_list = "未爬取"
movie_theme_list = "未爬取"
movie = {
"top": movie_num,
"titles": movie_titles_list,
"star": movie_star,
"link": movie_link,
"cover": movie_cover,
"quote": movie_quote,
"director": movie_director,
"actor": movie_actor,
"year": movie_year,
"country": movie_country_list,
"theme": movie_theme_list,
"story": movie_story,
}
movies.append(movie)
i += 25
with open("movie.yaml", "wb") as f:
# data数据中有汉字时,加上:encoding='utf-8',allow_unicode=True
f.write(yaml.dump(movies, encoding='utf-8', allow_unicode=True, indent=4))

billboard.py运行结果:

movie.yaml

movie.yaml读取信息并运行billboard_cover.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import yaml
import os
from billboard import get_HTML
# 从yaml读取并获取zip(cover)


with open("movie.yaml", "rb") as f:
movie_list = yaml.load(f.read(), Loader=yaml.FullLoader)
urls = [] # 从yaml文件获取图片链接地址
titles = []
top = 0
for movie in movie_list:
urls.append(movie["cover"])
titles.append(movie["titles"][0])
for title, url in zip(titles, urls):
top += 1
path = "./pic/" + title + ".jpg"
if os.path.isfile(path):
print("top", top, ":", title, "<", url, ">:已经下载!")
else:
with open(path, "wb") as f:
f.write(get_HTML(url, "content"))
print("top", top, ":", title, "<", url, ">:已经下载!")

billboard_cover.py

运行结果:

参考

常用浏览器请求头: https://pengshiyu.blog.csdn.net/article/details/80182397

代理: https://www.jianshu.com/p/5958b0cb3de2

代理服务:

爬虫使用cookie: https://blog.csdn.net/weixin_38706928/article/details/80376572

yaml文件读写: https://blog.csdn.net/lmj19851117/article/details/78843486

漂亮汤官方文档: https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/