我想你可能会尝试解析 YAML 并将其加载到数据框中,正常化 it:
import pandas as pd
from yaml import safe_load
with open('legislators-historical.yaml', 'r') as f:
df = pd.json_normalize(safe_load(f))
print(df.head())
Output:
bio.birthday bio.gender bio.religion id.bioguide id.fec id.govtrack \
0 1943-12-02 M Protestant A000109 [S6CO00168] 300003
1 1745-04-02 M NaN B000226 NaN 401222
2 1742-03-21 M NaN B000546 NaN 401521
3 1743-06-16 M NaN B001086 NaN 402032
4 1730-07-22 M NaN C000187 NaN 402334
id.house_history id.icpsr id.lis id.opensecrets id.thomas id.votesmart \
0 8410 29108 S250 N00009082 00011 26783
1 NaN 507 NaN NaN NaN NaN
2 9479 786 NaN NaN NaN NaN
3 10177 1260 NaN NaN NaN NaN
4 10687 1538 NaN NaN NaN NaN
id.wikipedia name.first name.last name.middle \
0 Wayne Allard Wayne Allard A.
1 NaN Richard Bassett NaN
2 NaN Theodorick Bland NaN
3 Aedanus Burke Aedanus Burke NaN
4 Daniel Carroll Daniel Carroll NaN
terms
0 [{'party': 'Republican', 'type': 'rep', 'state...
1 [{'party': 'Anti-Administration', 'type': 'sen...
2 [{'end': '1791-03-03', 'district': 9, 'type': ...
3 [{'end': '1791-03-03', 'district': 2, 'type': ...
4 [{'end': '1791-03-03', 'district': 6, 'type': ...
UPDATE:
以下版本将过滤您的输入数据,以便仅处理包含“thomas”和“fec”的记录:
import pandas as pd
from yaml import safe_load
def read_yaml(fn):
with open(fn, 'r') as fi:
return safe_load(fi)
def filter_data(data):
result_data = []
for x in data:
if 'id' not in x: continue
if 'fec' not in x['id']: continue
if 'thomas' not in x['id']: continue
result_data.append(x)
return result_data
fn = 'aaa.yaml'
df = pd.json_normalize(filter_data(read_yaml(fn)), 'terms', [['id', 'fec'], ['id', 'thomas']])
print(df.head())
df.to_csv('out.csv')
Output:
class district end party start state type \
0 NaN 4 1993-01-03 Republican 1991-01-03 CO rep
1 NaN 4 1995-01-03 Republican 1993-01-05 CO rep
2 NaN 4 1997-01-03 Republican 1995-01-04 CO rep
3 2 NaN 2003-01-03 Republican 1997-01-07 CO sen
4 2 NaN 2009-01-03 Republican 2003-01-07 CO sen
url id.thomas id.fec
0 NaN 00011 S6CO00168
1 NaN 00011 S6CO00168
2 NaN 00011 S6CO00168
3 NaN 00011 S6CO00168
4 http://allard.senate.gov 00011 S6CO00168
PS,如您所见,这将重复您的行(请参阅:id.thomas
and id.fec
)以便可以将其显示为数据框
UPDATE2
您可能还想将“id.fec”中的列表转换为列,但我会在附加数据框中执行此操作:
df_fec = df['id.fec'].apply(pd.Series)
print(df_fec.head())
Output:
0 1
0 S8AR00112 H2AR01022
1 S8AR00112 H2AR01022
2 S8AR00112 H2AR01022
3 S8AR00112 H2AR01022
4 S6CO00168 NaN