yaml,json和xml

yaml,json和xml

最近做一个爬虫作业(知乎爬取top50),使用yaml保存数据,数据是一个字典列表,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[
{
"top": 34,
"url": "https://www.zhihu.com/question/265595875",
"question": "考研如何确定学校?",
"recommend": 216,
"comment": 1,
"star": 4695,
"watcher": 1573098,
"date": "2020-05-22 07:25:17"
},
{
"top": 35,
"url": "https://www.zhihu.com/question/432129590",
"question": "如果女生主动一点,男生会不会心动?",
"recommend": null,
"comment": "添加评论",
"star": 154,
"watcher": 175721,
"date": "2020-12-08 15:05:46"
},
]

用yaml.dump保存时出错:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
with open('hot_list.yaml','wb') as f:
f.write(yaml.dump(hot_list, encoding='utf-8', allow_unicode=True, indent=4))
"""
Traceback (most recent call last):
File "G:/Code/IntelliJ Pycharm/Crawler/zhihu/zhihu_crawler.py", line 98, in <module>
get_section(get_url())
File "G:/Code/IntelliJ Pycharm/Crawler/zhihu/zhihu_crawler.py", line 96, in get_section
f.write(yaml.dump(data_list, encoding='utf-8', allow_unicode=True, indent=4))
File "G:\python environment\lib\site-packages\yaml\__init__.py", line 290, in dump
return dump_all([data], stream, Dumper=Dumper, **kwds)
File "G:\python environment\lib\site-packages\yaml\__init__.py", line 278, in dump_all
dumper.represent(data)
File "G:\python environment\lib\site-packages\yaml\representer.py", line 27, in represent
node = self.represent_data(data)
File "G:\python environment\lib\site-packages\yaml\representer.py", line 48, in represent_data
node = self.yaml_representers[data_types[0]](self, data)
...
省略多行yaml库内部错误
...
File "G:\python environment\lib\site-packages\yaml\representer.py", line 34, in represent_data
if self.ignore_aliases(data):
File "G:\python environment\lib\site-packages\yaml\representer.py", line 139, in ignore_aliases
if isinstance(data, tuple) and data == ():
RecursionError: maximum recursion depth exceeded while calling a Python object

Process finished with exit code 1
"""

此处提示递归深度超出,如果是常规的深度超出,我们可以手动设置得更深一点:

1
2
import sys 
sys.setrecursionlimit(1500)#设置最大递归层数为1500

但是此处明显不是python递归的问题,将yaml.dump改为yaml.safe_dump()方法,错误如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
with open("hot_list.yaml", "w") as f:
f.write(yaml.safe_dump(data_list, allow_unicode=True, indent=4))
"""
Traceback (most recent call last):
File "G:/Code/IntelliJ Pycharm/Crawler/zhihu/zhihu_crawler.py", line 95, in <module>
get_section(get_url())
File "G:/Code/IntelliJ Pycharm/Crawler/zhihu/zhihu_crawler.py", line 93, in get_section
f.write(yaml.safe_dump(data_list, allow_unicode=True, indent=4))
File "G:\python environment\lib\site-packages\yaml\__init__.py", line 306, in safe_dump
return dump_all([data], stream, Dumper=SafeDumper, **kwds)
File "G:\python environment\lib\site-packages\yaml\__init__.py", line 278, in dump_all
dumper.represent(data)
File "G:\python environment\lib\site-packages\yaml\representer.py", line 27, in represent
node = self.represent_data(data)
File "G:\python environment\lib\site-packages\yaml\representer.py", line 48, in represent_data
node = self.yaml_representers[data_types[0]](self, data)
File "G:\python environment\lib\site-packages\yaml\representer.py", line 199, in represent_list
return self.represent_sequence('tag:yaml.org,2002:seq', data)
File "G:\python environment\lib\site-packages\yaml\representer.py", line 92, in represent_sequence
node_item = self.represent_data(item)
File "G:\python environment\lib\site-packages\yaml\representer.py", line 48, in represent_data
node = self.yaml_representers[data_types[0]](self, data)
File "G:\python environment\lib\site-packages\yaml\representer.py", line 207, in represent_dict
return self.represent_mapping('tag:yaml.org,2002:map', data)
File "G:\python environment\lib\site-packages\yaml\representer.py", line 118, in represent_mapping
node_value = self.represent_data(item_value)
File "G:\python environment\lib\site-packages\yaml\representer.py", line 58, in represent_data
node = self.yaml_representers[None](self, data)
File "G:\python environment\lib\site-packages\yaml\representer.py", line 231, in represent_undefined
raise RepresenterError("cannot represent an object", data)
yaml.representer.RepresenterError: ('cannot represent an object', '2020-12-08 11:03:36')
2020-12-08 07:47:44
"""

报的是日期对象错误,大概yaml将日期认成是data对象了,两个错误不相干,但大抵能猜到时数据格式的错误。

然后我使用json格式作为保存方式,保存完成,就是上面展示的json文件,代码如下:

1
2
with open('hot_list.json','w',encoding='utf-8') as f:
f.write(json.dumps(data_list,indent=4,ensure_ascii=False, separators=(', ', ': ')))

我们将json文件里的数据读出,并用data_list接收,随后再将data_list写入yaml文件,成功。

1
2
3
4
5
with open('hot_list.json','r',encoding='utf-8') as f:
hot_list=json.loads(f.read())
print(hot_list)
with open('hot_list.yaml','wb') as f:
f.write(yaml.dump(hot_list, encoding='utf-8', allow_unicode=True, indent=4))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-   comment: 添加评论
date: '2020-12-08 15:05:46'
question: 如果女生主动一点,男生会不会心动?
recommend: null
star: 154
top: 35
url: https://www.zhihu.com/question/432129590
watcher: 175721
- comment: 1
date: '2020-12-08 13:35:09'
question: 哈登昨日没有去球队报道,他和火箭队的后续会怎样?
recommend: 8
star: 40
top: 36
url: https://www.zhihu.com/question/433854967
watcher: 29397

总结,bug原因是数据的格式不满足yaml的要求,具体是哪里的格式错误呢,仔细找了一圈下来,特殊的格式只有None。

毕竟爬虫能不能爬到数据还是另说,方便后面pandas对数据进行清洗,我对字典的初始化方式为:

1
2
3
4
5
6
7
8
9
10
data = {
'top': None,
'url': None,
'question': None,
'recommend': None,
'comment': None,
'star': None,
'watcher': None,
'date': None,
}

这就是问题所在了,当爬虫未能爬取到数据时,字典部分值为None,显然yaml不能接受None,导致bug。

但是json能接受这个格式,并在文件里将其保存为null,在存入json文件后数据None转为了null(字符串?),所以也能被yaml操作了。

json

1
2
import json
data = [{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}]

如果data是str类型

1
data=data.replace("\'", "\"")

如果data包含unicode,想要移除unicode编码

1
data = re.sub(r'\\(?![/u"])', r"\\\\", str(data))

如果data包含中文

1
2
3
4
5
with open('numbers.json','w',encoding='utf-8') as f:
f.write(json.dumps(data_list,indent=4,ensure_ascii=False, separators=(', ', ': ')))
with open('numbers.json','r',encoding='utf-8') as f:
read=json.loads(f.read())
print(read)

dumps返回的是一个字符串,使用f.write(json.dumps(data))

1
2
3
4
5
6
7
8
import json
data = [{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}]
with open('numbers.json','w') as f:
# dumps直接给的是str,也就是将字典转成json
f.write(json.dumps(data, sort_keys=True,indent=4, separators=(', ', ': ')))
with open('numbers.json', 'r') as f:
read=json.loads(f.read())
print(read)

dump需要一个文件对象作为写入参数,使用json.dump(data, fp=f)

1
2
3
4
5
import json
data = [{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}]
with open('numbers2.json','w') as f:
# dump需要一个类似于文件参数,需要与文件结合
json.dump(data, fp=f, sort_keys=True, indent=4, separators=(', ', ': '))

文件读

1
2
3
with open('numbers.json', 'r') as f:
read=json.load(f)
print(read)

yaml

yaml没有dumps和loads写法,使用f.write(yaml.dump(data))

二进制写来写入非str类型数据,encode选择中文编码

1
2
3
4
5
6
7
8
9
10
11
12
13
import yaml
data = [{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}]


with open("numberb.yaml", "wb") as f:
# 二进制读写
# data数据中有汉字时,加上:encoding='utf-8',allow_unicode=True
f.write(yaml.dump(data, encoding='utf-8', allow_unicode=True, indent=4))


with open("numberb.yaml", "rb") as f:
read = yaml.load(f.read(), Loader=yaml.FullLoader)
print(read)

一般用法

1
2
3
4
5
6
7
8
9
10
11
import yaml
data = [{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}]


with open("numbers.yaml","w") as f:
f.write(yaml.dump(data, allow_unicode=True, indent=4))


with open("numbers.yaml","r") as f:
read=yaml.load(f.read(), Loader=yaml.FullLoader)
print(read)

Object of type datetime is not JSON serializable,json不支持保存datetime对象,但是可以先保存字符串,文件读取之后再对字符串进行转日期对象操作。

1
2
import datetime
datetime.datetime.strptime('2020-12-08 07:47:44','%Y-%m-%d %H:%M:%S')

XML

这个格式太丑了,以后有机会再用,到网络编程的时候再学吧。