Use DataFrame
构造函数与str.get_dummies http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.get_dummies.html:
L = [['1', 'Toy Story (1995)', "Animation|Children's|Comedy"],
['2', 'Jumanji (1995)', "Adventure|Children's|Fantasy"],
['3', 'Grumpier Old Men (1995)', 'Comedy|Romance']]
df = pd.DataFrame(L, columns=['MovieID','Name','Data'])
df1 = df['Data'].str.get_dummies()
print (df1)
Adventure Animation Children's Comedy Fantasy Romance
0 0 1 1 1 0 0
1 1 0 1 0 1 0
2 0 0 0 1 0 1
对于列Name
and Year
need split http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html and rstrip http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.rstrip.html用于删除尾随)
, also Year
被转换为int
s.
df[['Name','Year']] = df['Name'].str.split('\s\(', expand=True)
df['Year'] = df['Year'].str.rstrip(')').astype(int)
最后删除列Data
并添加df1
原始由join http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html:
df = df.drop('Data', axis=1).join(df1)
print (df)
MovieID Name Year Adventure Animation Children's Comedy \
0 1 Toy Story 1995 0 1 1 1
1 2 Jumanji 1995 1 0 1 0
2 3 Grumpier Old Men 1995 0 0 0 1
Fantasy Romance
0 0 0
1 1 0
2 0 1