yaml,json和xml

发表于 2020-12-08 更新于 2021-05-31 分类于笔记

只是简单的使用方法。

yaml,json和xml

最近做一个爬虫作业（知乎爬取top50），使用yaml保存数据，数据是一个字典列表，如下：

[
    {
        "top": 34, 
        "url": "https://www.zhihu.com/question/265595875", 
        "question": "考研如何确定学校？", 
        "recommend": 216, 
        "comment": 1, 
        "star": 4695, 
        "watcher": 1573098, 
        "date": "2020-05-22 07:25:17"
    }, 
    {
        "top": 35, 
        "url": "https://www.zhihu.com/question/432129590", 
        "question": "如果女生主动一点，男生会不会心动？", 
        "recommend": null, 
        "comment": "添加评论", 
        "star": 154, 
        "watcher": 175721, 
        "date": "2020-12-08 15:05:46"
    }, 
]

用yaml.dump保存时出错：

with open('hot_list.yaml','wb') as f:
    f.write(yaml.dump(hot_list, encoding='utf-8', allow_unicode=True, indent=4))
"""
Traceback (most recent call last):
  File "G:/Code/IntelliJ Pycharm/Crawler/zhihu/zhihu_crawler.py", line 98, in <module>
    get_section(get_url())
  File "G:/Code/IntelliJ Pycharm/Crawler/zhihu/zhihu_crawler.py", line 96, in get_section
    f.write(yaml.dump(data_list, encoding='utf-8', allow_unicode=True, indent=4))
  File "G:\python environment\lib\site-packages\yaml\__init__.py", line 290, in dump
    return dump_all([data], stream, Dumper=Dumper, **kwds)
  File "G:\python environment\lib\site-packages\yaml\__init__.py", line 278, in dump_all
    dumper.represent(data)
  File "G:\python environment\lib\site-packages\yaml\representer.py", line 27, in represent
    node = self.represent_data(data)
  File "G:\python environment\lib\site-packages\yaml\representer.py", line 48, in represent_data
    node = self.yaml_representers[data_types[0]](self, data)
  ...
  省略多行yaml库内部错误
  ...
  File "G:\python environment\lib\site-packages\yaml\representer.py", line 34, in represent_data
    if self.ignore_aliases(data):
  File "G:\python environment\lib\site-packages\yaml\representer.py", line 139, in ignore_aliases
    if isinstance(data, tuple) and data == ():
RecursionError: maximum recursion depth exceeded while calling a Python object

Process finished with exit code 1
"""

此处提示递归深度超出，如果是常规的深度超出，我们可以手动设置得更深一点：

1 2	import sys sys.setrecursionlimit(1500)#设置最大递归层数为1500

但是此处明显不是python递归的问题，将yaml.dump改为yaml.safe_dump()方法，错误如下：

with open("hot_list.yaml", "w") as f:
    f.write(yaml.safe_dump(data_list, allow_unicode=True, indent=4))
"""
Traceback (most recent call last):
  File "G:/Code/IntelliJ Pycharm/Crawler/zhihu/zhihu_crawler.py", line 95, in <module>
    get_section(get_url())
  File "G:/Code/IntelliJ Pycharm/Crawler/zhihu/zhihu_crawler.py", line 93, in get_section
    f.write(yaml.safe_dump(data_list, allow_unicode=True, indent=4))
  File "G:\python environment\lib\site-packages\yaml\__init__.py", line 306, in safe_dump
    return dump_all([data], stream, Dumper=SafeDumper, **kwds)
  File "G:\python environment\lib\site-packages\yaml\__init__.py", line 278, in dump_all
    dumper.represent(data)
  File "G:\python environment\lib\site-packages\yaml\representer.py", line 27, in represent
    node = self.represent_data(data)
  File "G:\python environment\lib\site-packages\yaml\representer.py", line 48, in represent_data
    node = self.yaml_representers[data_types[0]](self, data)
  File "G:\python environment\lib\site-packages\yaml\representer.py", line 199, in represent_list
    return self.represent_sequence('tag:yaml.org,2002:seq', data)
  File "G:\python environment\lib\site-packages\yaml\representer.py", line 92, in represent_sequence
    node_item = self.represent_data(item)
  File "G:\python environment\lib\site-packages\yaml\representer.py", line 48, in represent_data
    node = self.yaml_representers[data_types[0]](self, data)
  File "G:\python environment\lib\site-packages\yaml\representer.py", line 207, in represent_dict
    return self.represent_mapping('tag:yaml.org,2002:map', data)
  File "G:\python environment\lib\site-packages\yaml\representer.py", line 118, in represent_mapping
    node_value = self.represent_data(item_value)
  File "G:\python environment\lib\site-packages\yaml\representer.py", line 58, in represent_data
    node = self.yaml_representers[None](self, data)
  File "G:\python environment\lib\site-packages\yaml\representer.py", line 231, in represent_undefined
    raise RepresenterError("cannot represent an object", data)
yaml.representer.RepresenterError: ('cannot represent an object', '2020-12-08 11:03:36')
2020-12-08 07:47:44
"""

报的是日期对象错误，大概yaml将日期认成是data对象了，两个错误不相干，但大抵能猜到时数据格式的错误。

然后我使用json格式作为保存方式，保存完成，就是上面展示的json文件，代码如下：

1 2	with open('hot_list.json','w',encoding='utf-8') as f: f.write(json.dumps(data_list,indent=4,ensure_ascii=False, separators=(', ', ': ')))

我们将json文件里的数据读出，并用data_list接收，随后再将data_list写入yaml文件，成功。

with open('hot_list.json','r',encoding='utf-8') as f:
    hot_list=json.loads(f.read())
print(hot_list)
with open('hot_list.yaml','wb') as f:
    f.write(yaml.dump(hot_list, encoding='utf-8', allow_unicode=True, indent=4))

-   comment: 添加评论
    date: '2020-12-08 15:05:46'
    question: 如果女生主动一点，男生会不会心动？
    recommend: null
    star: 154
    top: 35
    url: https://www.zhihu.com/question/432129590
    watcher: 175721
-   comment: 1
    date: '2020-12-08 13:35:09'
    question: 哈登昨日没有去球队报道，他和火箭队的后续会怎样？
    recommend: 8
    star: 40
    top: 36
    url: https://www.zhihu.com/question/433854967
    watcher: 29397

总结，bug原因是数据的格式不满足yaml的要求，具体是哪里的格式错误呢，仔细找了一圈下来，特殊的格式只有None。

毕竟爬虫能不能爬到数据还是另说，方便后面pandas对数据进行清洗，我对字典的初始化方式为：

data = {
            'top': None,
            'url': None,
            'question': None,
            'recommend': None,
            'comment': None,
            'star': None,
            'watcher': None,
            'date': None,
        }

这就是问题所在了，当爬虫未能爬取到数据时，字典部分值为None，显然yaml不能接受None，导致bug。

但是json能接受这个格式，并在文件里将其保存为null，在存入json文件后数据None转为了null（字符串？），所以也能被yaml操作了。

json

1 2	import json data = [{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}]

如果data是str类型

1	data=data.replace("\'", "\"")

如果data包含unicode,想要移除unicode编码

1	data = re.sub(r'\\(?![/u"])', r"\\\\", str(data))

如果data包含中文

with open('numbers.json','w',encoding='utf-8') as f:
    f.write(json.dumps(data_list,indent=4,ensure_ascii=False, separators=(', ', ': ')))
with open('numbers.json','r',encoding='utf-8') as f:
    read=json.loads(f.read())
print(read)

dumps返回的是一个字符串，使用f.write(json.dumps(data))

import json
data = [{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}]
with open('numbers.json','w') as f:
    # dumps直接给的是str，也就是将字典转成json
    f.write(json.dumps(data, sort_keys=True,indent=4, separators=(', ', ': ')))
with open('numbers.json', 'r') as f:
    read=json.loads(f.read())
print(read)

dump需要一个文件对象作为写入参数，使用json.dump(data, fp=f)

import json
data = [{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}]
with open('numbers2.json','w') as f:
    # dump需要一个类似于文件参数,需要与文件结合
    json.dump(data, fp=f, sort_keys=True, indent=4, separators=(', ', ': '))

文件读

1
2
3

with open('numbers.json', 'r') as f:
    read=json.load(f)
print(read)

yaml

yaml没有dumps和loads写法，使用f.write(yaml.dump(data))

二进制写来写入非str类型数据，encode选择中文编码

import yaml
data = [{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}]


with open("numberb.yaml", "wb") as f:
    # 二进制读写
    # data数据中有汉字时，加上：encoding='utf-8',allow_unicode=True
    f.write(yaml.dump(data, encoding='utf-8', allow_unicode=True, indent=4))
    
    
with open("numberb.yaml", "rb") as f:
    read = yaml.load(f.read(), Loader=yaml.FullLoader)
print(read)

一般用法

import yaml
data = [{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}]


with open("numbers.yaml","w") as f:
    f.write(yaml.dump(data, allow_unicode=True, indent=4))
    
    
with open("numbers.yaml","r") as f:
    read=yaml.load(f.read(), Loader=yaml.FullLoader)
print(read)

Object of type datetime is not JSON serializable，json不支持保存datetime对象，但是可以先保存字符串，文件读取之后再对字符串进行转日期对象操作。

1 2	import datetime datetime.datetime.strptime('2020-12-08 07:47:44','%Y-%m-%d %H:%M:%S')

XML

这个格式太丑了，以后有机会再用，到网络编程的时候再学吧。