我正在尝试加载一个非常令人困惑的多重嵌套JSON
变成熊猫。我已经在使用了json_规范化 http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html但试图弄清楚我如何加入两个类似的嵌套dict
以及解压他们的子包dict
s and list
s一直难倒我。我对 pandas 的了解有限,但我假设如果我能把它记下来,我就可以利用它的性能优势。
我有 2 个包含战争数据的字典,一个从 JSON API 响应加载,另一个在数据库中。我正在尝试比较两者的新攻击和防御。
战争例子
{
"state": "active",
"team_size": 20,
"teams": {
"id": "12345679",
"name": "Good Guys",
"level": 10,
"attacks": 4,
"destruction_percentage": 22.6,
"members": [
{
"id": "1",
"name": "John",
"level": 12
},
{
"id": "2",
"name": "Tom",
"level": 11,
"attacks": [
{
"attackerTag": "2",
"defenderTag": "4",
"damage": 64,
"order": 7
}
]
}
]
},
"opponent": {
"id": "987654321",
"name": "Bad Guys",
"level": 17,
"attacks": 5,
"damage": 20.95,
"members": [
{
"id": "3",
"name": "Betty",
"level": 17,
"attacks": [
{
"attacker_id": "3",
"defender_id": "1",
"damage": 70,
"order": 1
},
{
"attacker_id": "3",
"defender_id": "7",
"damage": 100,
"order": 11
}
],
"opponentAttacks": 0,
"some_useless_data": "Want to ignore, this doesn't show in every record"
},
{
"id": "4",
"name": "Fred",
"level": 9,
"attacks": [
{
"attacker_id": "4",
"defender_id": "9",
"damage": 70,
"order": 4
}
],
"opponentAttacks": 0
}
]
}
}
现在我假设 pandas 就性能而言是我的最佳选择,而不是将它们压缩在一起并循环遍历每个成员并比较它们。
所以我尝试得到dataframe
至少可以说,平坦且易于穿越的东西很难。最好我假设以下布局。我只是想把两支球队打成一个队df
仅限所有成员。我们可以省略state
and team_size
键并专注于让每个成员及其各自的attacks
and team_id
's
example df
(预期结果):
id name level attacks member.team_id ...
1 John 12 NaN "123456789"
2 Tom 11 [{...}] "123456789"
3 Betty 17 [{...}, {...}] "987654321"
4 Fred 9 [{...}] "987654321"
这就是我作为一个人想要的基本要点df
。这样我就可以获取两个数据帧并比较新的攻击。
Note I just pop()
'd state
and team_size
来自我尝试之前的指令,因为我想要的只是所有成员,并且团队几乎融入其中
我尝试了以下方法,但没有成功,我知道这不是正确的方法,因为它是在字典树上向后工作的。
old_df = json_normalize(war,
'members',
['id', 'name', 'level', 'attacks'],
record_prefix='member')
#Traceback (most recent call last):
# File "test.py", line 83, in <module>
# new_parse(old_war, new_war)
# File "test.py", line 79, in new_parse
# record_prefix='member')
# File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/io/json/normalize.py", line 262, in json_normalize
# _recursive_extract(data, record_path, {}, level=0)
# File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/io/json/normalize.py", line 238, in _recursive_extract
# recs = _pull_field(obj, path[0])
# File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/io/json/normalize.py", line 185, in _pull_field
# result = result[spec]
#KeyError: 'members'
我以为我可以使用类似以下的东西,但这也不起作用。
df = pd.DataFrame.from_dict(old, orient='index')
df.droplevel('members')
#Traceback (most recent call last):
# File "test.py", line 106, in <module>
# new_parse(old_war, new_war)
# File "test.py", line 87, in new_parse
# df.droplevel('members')
# File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/core/generic.py", line 4376, in __getattr__
# return object.__getattribute__(self, name)
#AttributeError: 'DataFrame' object has no attribute 'droplevel'
我很感激任何指导!希望我投入了足够的精力来帮助理解我的预期结果,如果没有,请告诉我!
Edit公平地说,我确实知道如何做到这一点,只需循环字典并创建具有适当日期的新成员列表,但我觉得这比使用 pandas 效率低得多,因为我正在为数百万场战争这样做线程应用程序以及我能从中获得的每一点性能对我和应用程序来说都是额外的好处。 - 再次感谢!