我喜欢用sklearn.preprocessing.LabelEncoder http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html进行字母到数字的转换:
from sklearn.preprocessing import LabelEncoder
# Perform the groupby (before converting letters to digits).
df = df.groupby(['ID_0', 'ID_1']).size().rename('count').reset_index()
# Initialize the LabelEncoder.
le = LabelEncoder()
le.fit(df[['ID_0', 'ID_1']].values.flat)
# Convert to digits.
df[['ID_0', 'ID_1']] = df[['ID_0', 'ID_1']].apply(le.transform)
结果输出:
ID_0 ID_1 count
0 0 2 2
1 1 3 1
2 2 0 3
3 3 4 1
如果您想稍后再转换回字母,可以使用le.inverse_transform
:
df[['ID_0', 'ID_1']] = df[['ID_0', 'ID_1']].apply(le.inverse_transform)
它按预期映射回来:
ID_0 ID_1 count
0 a c 2
1 b f 1
2 c a 3
3 f g 1
如果您只想知道哪个数字对应哪个字母,您可以查看le.classes_
属性。这将为您提供一个字母数组,该数组按其编码的数字进行索引:
le.classes_
['a' 'b' 'c' 'f' 'g']
为了获得更直观的表示,您可以将其转换为系列:
pd.Series(le.classes_)
0 a
1 b
2 c
3 f
4 g
Timings
使用更大版本的示例数据和以下设置:
df2 = pd.concat([df]*10**5, ignore_index=True)
def root(df):
df = df.groupby(['ID_0', 'ID_1']).size().rename('count').reset_index()
le = LabelEncoder()
le.fit(df[['ID_0', 'ID_1']].values.flat)
df[['ID_0', 'ID_1']] = df[['ID_0', 'ID_1']].apply(le.transform)
return df
def pir2(df):
unq = np.unique(df)
mapping = pd.Series(np.arange(unq.size), unq)
return df.stack().map(mapping).unstack() \
.groupby(df.columns.tolist()).size().reset_index(name='count')
我得到以下时间:
%timeit root(df2)
10 loops, best of 3: 101 ms per loop
%timeit pir2(df2)
1 loops, best of 3: 1.69 s per loop