熊猫 >= 0.16
pd.Series.mode
可用!
Use groupby, GroupBy.agg,并应用pd.Series.mode每个组的功能:
source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)
Country City
Russia Sankt-Petersburg Spb
USA New-York NY
Name: Short name, dtype: object
如果需要将其作为 DataFrame,请使用
source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode).to_frame()
Short name
Country City
Russia Sankt-Petersburg Spb
USA New-York NY
有用的事情Series.mode
是它总是返回一个系列,这使得它非常兼容agg
and apply
,尤其是在重建 groupby 输出时。它也更快。
# Accepted answer.
%timeit source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
# Proposed in this post.
%timeit source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)
5.56 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.76 ms ± 387 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
处理多种模式
Series.mode
当有时也做得很好multiple modes:
source2 = source.append(
pd.Series({'Country': 'USA', 'City': 'New-York', 'Short name': 'New'}),
ignore_index=True)
# Now `source2` has two modes for the
# ("USA", "New-York") group, they are "NY" and "New".
source2
Country City Short name
0 USA New-York NY
1 USA New-York New
2 Russia Sankt-Petersburg Spb
3 USA New-York NY
4 USA New-York New
source2.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)
Country City
Russia Sankt-Petersburg Spb
USA New-York [NY, New]
Name: Short name, dtype: object
或者,如果您想要每种模式单独一行,您可以使用GroupBy.apply:
source2.groupby(['Country','City'])['Short name'].apply(pd.Series.mode)
Country City
Russia Sankt-Petersburg 0 Spb
USA New-York 0 NY
1 New
Name: Short name, dtype: object
If you 不在乎只要它是其中之一,就会返回哪种模式,那么您将需要一个调用的 lambdamode
并提取第一个结果。
source2.groupby(['Country','City'])['Short name'].agg(
lambda x: pd.Series.mode(x)[0])
Country City
Russia Sankt-Petersburg Spb
USA New-York NY
Name: Short name, dtype: object
考虑(不)考虑的替代方案
您还可以使用statistics.mode来自 python,但是...
source.groupby(['Country','City'])['Short name'].apply(statistics.mode)
Country City
Russia Sankt-Petersburg Spb
USA New-York NY
Name: Short name, dtype: object
...当必须处理多种模式时,它不能很好地工作; AStatisticsError
被提出。文档中提到了这一点:
如果数据为空,或者不存在最常见的值,
引发统计错误。
但你可以亲自看看...
statistics.mode([1, 2])
# ---------------------------------------------------------------------------
# StatisticsError Traceback (most recent call last)
# ...
# StatisticsError: no unique mode; found 2 equally common values