如果只有一列，为什么 Pandas 转换会失败

2024-03-05

看完这个之后question https://stackoverflow.com/questions/19265942/pandas-create-a-new-column-filled-with-the-number-of-observations-in-another-co我做了一些乱七八糟的事情，发现了这个：

import pandas as pd

df = pd.DataFrame({'a':[1,1,1,1,2,2,3,3,3,4,4,4,4,4,4,4]})
df['num_totals'] = df.groupby('a').transform('count')

gives ValueError:

ValueError                                Traceback (most recent call last)
<ipython-input-38-157c6339ad93> in <module>()
      3 #df = pd.DataFrame({'a':[1,1,1,1,2,2,3,3,3,4,4,4,4,4,4,4], 'b':[1,1,1,1,2,2,3,3,3,4,4,4,4,4,4,4]})
      4 df = pd.DataFrame({'a':[1,1,1,1,2,2,3,3,3,4,4,4,4,4,4,4]})
----> 5 df['num_totals'] = df.groupby('a').transform('count')
      6 
      7 #df['num_totals']=df.groupby('a')[['a']].transform('count')

C:\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\core\frame.pyc in __setitem__(self, key, value)
   2117         else:
   2118             # set column
-> 2119             self._set_item(key, value)
   2120 
   2121     def _setitem_slice(self, key, value):

C:\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\core\frame.pyc in _set_item(self, key, value)
   2164         """
   2165         value = self._sanitize_column(key, value)
-> 2166         NDFrame._set_item(self, key, value)
   2167 
   2168     def insert(self, loc, column, value, allow_duplicates=False):

C:\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\core\generic.pyc in _set_item(self, key, value)
    677 
    678     def _set_item(self, key, value):
--> 679         self._data.set(key, value)
    680         self._clear_item_cache()
    681 

C:\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\core\internals.pyc in set(self, item, value)
   1779         except KeyError:
   1780             # insert at end
-> 1781             self.insert(len(self.items), item, value)
   1782 
   1783         self._known_consolidated = False

C:\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\core\internals.pyc in insert(self, loc, item, value, allow_duplicates)
   1793 
   1794             # new block
-> 1795             self._add_new_block(item, value, loc=loc)
   1796 
   1797         except:

C:\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\core\internals.pyc in _add_new_block(self, item, value, loc)
   1909             loc = self.items.get_loc(item)
   1910         new_block = make_block(value, self.items[loc:loc + 1].copy(),
-> 1911                                self.items, fastpath=True)
   1912         self.blocks.append(new_block)
   1913 

C:\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\core\internals.pyc in make_block(values, items, ref_items, klass, fastpath, placement)
    964             klass = ObjectBlock
    965 
--> 966     return klass(values, items, ref_items, ndim=values.ndim, fastpath=fastpath, placement=placement)
    967 
    968 # TODO: flexible with index=None and/or items=None

C:\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\core\internals.pyc in __init__(self, values, items, ref_items, ndim, fastpath, placement)
     42         if len(items) != len(values):
     43             raise ValueError('Wrong number of items passed %d, indices imply %d'
---> 44                              % (len(items), len(values)))
     45 
     46         self.set_ref_locs(placement)

ValueError: Wrong number of items passed 1, indices imply 0

但如果我有 2 列，那么它就可以正常工作：

df = pd.DataFrame({'a':1,1,1,1,2,2,3,3,3,4,4,4,4,4,4,4],'b':1,1,1,1,2,2,3,3,3,4,4,4,4,4,4,4]})
df['num_totals'] = df.groupby('a').transform('count')
df



Out[40]:
    a  b  num_totals
0   1  1           4
1   1  1           4
2   1  1           4
3   1  1           4
4   2  2           2
5   2  2           2
6   3  3           3
7   3  3           3
8   3  3           3
9   4  4           7
10  4  4           7
11  4  4           7
12  4  4           7
13  4  4           7
14  4  4           7
15  4  4           7

或者如果我使用单列 df 执行此操作：

df['num_totals']=df.groupby('a')[['a']].transform('count')

有一个类似的SO post https://stackoverflow.com/questions/13854476/pandas-transform-doesnt-work-sorting-groupby-output但我不清楚为什么一个系列应该失败并且一个数据框应该在上面的示例中工作，以及为什么有 2 个或更多列可以工作。

我正在使用 Python 2.7 64 位和 Pandas 0.12

DF 中的单列

正如您上面所指出的，这将返回一个与原始大小相同的系列

In [32]: df.groupby('a')['a'].transform('count')
Out[32]: 
0     4
1     4
2     4
3     4
4     2
5     2
6     3
7     3
8     3
9     7
10    7
11    7
12    7
13    7
14    7
15    7
Name: a, dtype: int64

但是，这返回一个空帧

In [33]: df.groupby('a').transform('count')
Out[33]: 
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

您不能将空框架作为列分配给另一个框架，因为这本质上是一个不明确的分配（您可以证明它应该“工作”）

起始 DF 中的两列

两列情况返回单列 DataFrame

In [42]: df2.groupby('a').transform('count')
Out[42]: 
    b
0   4
1   4
2   4
3   4
4   2
5   2
6   3
7   3
8   3
9   7
10  7
11  7
12  7
13  7
14  7
15  7

In [43]: type(df2.groupby('a').transform('count'))
Out[43]: pandas.core.frame.DataFrame

Or a series

In [45]: df2.groupby('a')['a'].transform('count')
Out[45]: 
0     4
1     4
2     4
3     4
4     2
5     2
6     3
7     3
8     3
9     7
10    7
11    7
12    7
13    7
14    7
15    7
Name: a, dtype: int64

In [46]: type(df.groupby('a')['a'].transform('count'))
Out[46]: pandas.core.series.Series

这是“有效的”，因为 pandas 确实允许分配单个列框架来工作，因为它将采用底层系列。

所以 pandas 实际上是想提供帮助。也就是说，我发现这是尝试分配空帧的不清楚的错误消息。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python

pandas