在 Python Pandas Dataframe 中动态添加列的数据处理

2024-03-18

我有以下问题。假设这是我的 CSV

id f1 f2 f3
1  4  5  5
1  3  1  0
1  7  4  4
1  4  3  1
1  1  4  6
2  2  6  0
..........

所以，我有可以按 id 分组的行。我想创建一个如下所示的 csv 作为输出。

f1 f2 f3 f1_n f2_n f3_n f1_n_n f2_n_n f3_n_n f1_t f2_t f3_t
4  5  5   3   1    0    7      4      4      1   4     6

因此，我希望能够选择要转换为列的行数（始终从 id 的第一行开始）。在本例中，我抓取了 3 行。然后，我还将跳过一行或多行（在本例中仅跳过一次），以从同一 id 组的最后一行获取最终列。由于某些原因，我想使用数据框。

经过3-4个小时的奋战。我找到了如下所示的解决方案。但我的解决方案非常慢。我有大约 700,000 行，可能有大约 70,000 组 id。上面的 model=3 代码在我的 4GB 4 核 Lenovo 上花费了将近一个小时。我需要转到 model = 也许是 10 或 15。我仍然是 Python 新手，我确信可以进行一些更改来加快速度。有人可以深入解释我如何改进代码吗？

万分感谢。

model : 要抓取的行数

# train data frame from reading the csv
train = pd.read_csv(filename)

# Get groups of rows with same id
csv_by_id = train.groupby('id')

modelTarget = { 'f1_t','f2_t','f3_t'}

# modelFeatures is a list of features I am interested in the csv. 
    # The csv actually has hundreds
modelFeatures = { 'f1, 'f2' , 'f3' }

coreFeatures = list(modelFeatures) # cloning 


selectedFeatures = list(modelFeatures) # cloning

newFeatures = list(selectedFeatures) # cloning

finalFeatures = list(selectedFeatures) # cloning

# Now create the column list depending on the number of rows I will grab from
for x in range(2,model+1):
    newFeatures = [s + '_n' for s in newFeatures]
    finalFeatures = finalFeatures + newFeatures

# This is the final column list for my one row in the final data frame
selectedFeatures = finalFeatures + list(modelTarget) 

# Empty dataframe which I want to populate
model_data = pd.DataFrame(columns=selectedFeatures)

for id_group in csv_by_id:
    #id_group is a tuple with first element as the id itself and second one a dataframe with the rows of a group
    group_data = id_group[1] 

    #hmm - can this be better? I am picking up the rows which I need from first row on wards
    df = group_data[coreFeatures][0:model] 

    # initialize a list
    tmp = [] 

    # now keep adding the column values into the list
    for index, row in df.iterrows(): 
        tmp = tmp + list(row)


    # Wow, this one below surely should have something better. 
    # So i am picking up the feature column values from the last row of the group of rows for a particular id 
    targetValues = group_data[list({'f1','f2','f3'})][len(group_data.index)-1:len(group_data.index)].values 

    # Think this can be done easier too ? . Basically adding the values to the tmp list again
    tmp = tmp + list(targetValues.flatten()) 

    # coverting the list to a dict.
    tmpDict = dict(zip(selectedFeatures,tmp))  

    # then the dict to a dataframe.
    tmpDf = pd.DataFrame(tmpDict,index={1}) 

    # I just could not find a better way of adding a dict or list directly into a dataframe. 
    # And I went through lots and lots of blogs on this topic, including some in StackOverflow.

    # finally I add the frame to my main frame
    model_data = model_data.append(tmpDf) 

# and write it
model_data.to_csv(wd+'model_data' + str(model) + '.csv',index=False)

Groupby http://pandas.pydata.org/pandas-docs/stable/groupby.html是你的朋友。

这会很好地扩展；特征数量只有一个很小的常数。大约是 O(组数)

In [28]: features = ['f1','f2','f3']

创建一些测试数据，组大小为7-12，70k组

In [29]: def create_df(i):
   ....:     l = np.random.randint(7,12)
   ....:     df = DataFrame(dict([ (f,np.arange(l)) for f in features ]))
   ....:     df['A'] = i
   ....:     return df
   ....: 

In [30]: df = concat([ create_df(i) for i in xrange(70000) ])

In [39]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 629885 entries, 0 to 9
Data columns (total 4 columns):
f1    629885 non-null int64
f2    629885 non-null int64
f3    629885 non-null int64
A     629885 non-null int64
dtypes: int64(4)

创建一个框架，在其中选择每组的前 3 行和最后一行（请注意，这将处理大小 groupby.filter来解决这个问题）

In [31]: groups = concat([df.groupby('A').head(3),df.groupby('A').tail(1)]).sort_index()

# This step is necesary in pandas < master/0.14 as the returned fields 
# will include the grouping field (the A), (is a bug/API issue)
In [33]: groups = groups[features]

In [34]: groups.head(20)
Out[34]: 
     f1  f2  f3
A              
0 0   0   0   0
  1   1   1   1
  2   2   2   2
  7   7   7   7
1 0   0   0   0
  1   1   1   1
  2   2   2   2
  9   9   9   9
2 0   0   0   0
  1   1   1   1
  2   2   2   2
  8   8   8   8
3 0   0   0   0
  1   1   1   1
  2   2   2   2
  8   8   8   8
4 0   0   0   0
  1   1   1   1
  2   2   2   2
  9   9   9   9

[20 rows x 3 columns]

In [38]: groups.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 280000 entries, (0, 0) to (69999, 9)
Data columns (total 3 columns):
f1    280000 non-null int64
f2    280000 non-null int64
f3    280000 non-null int64
dtypes: int64(3)

而且速度相当快

In [32]: %timeit concat([df.groupby('A').head(3),df.groupby('A').tail(1)]).sort_index()
1 loops, best of 3: 1.16 s per loop

为了进一步操作，您通常应该在这里停下来并使用它（因为它是一个很好的分组格式，很容易处理）。

如果您想将其转换为宽格式

In [35]: dfg = groups.groupby(level=0).apply(lambda x: Series(x.values.ravel()))

In [36]: %timeit groups.groupby(level=0).apply(lambda x: Series(x.values.ravel()))
dfg.head()
groups.info()
1 loops, best of 3: 14.5 s per loop
In [40]: dfg.columns = [ "{0}_{1}".format(f,i) for i in range(1,5) for f in features ]

In [41]: dfg.head()
Out[41]: 
   f1_1  f2_1  f3_1  f1_2  f2_2  f3_2  f1_3  f2_3  f3_3  f1_4  f2_4  f3_4
A                                                                        
0     0     0     0     1     1     1     2     2     2     7     7     7
1     0     0     0     1     1     1     2     2     2     9     9     9
2     0     0     0     1     1     1     2     2     2     8     8     8
3     0     0     0     1     1     1     2     2     2     8     8     8
4     0     0     0     1     1     1     2     2     2     9     9     9

[5 rows x 12 columns]

In [42]: dfg.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 70000 entries, 0 to 69999
Data columns (total 12 columns):
f1_1    70000 non-null int64
f2_1    70000 non-null int64
f3_1    70000 non-null int64
f1_2    70000 non-null int64
f2_2    70000 non-null int64
f3_2    70000 non-null int64
f1_3    70000 non-null int64
f2_3    70000 non-null int64
f3_3    70000 non-null int64
f1_4    70000 non-null int64
f2_4    70000 non-null int64
f3_4    70000 non-null int64
dtypes: int64(12)

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

在 Python Pandas Dataframe 中动态添加列的数据处理的相关文章

类属性在功能上依赖于其他类属性

我正在尝试使用静态类属性来定义另一个静态类属性我认为可以通过以下代码来实现 f lambda s s 1 class A foo foo bar f A foo 然而这导致NameError name A is not defined
如何自动替换多个文件的文本内容中的字符？

我有一个文件夹 myfolder包含许多乳胶表我需要替换其中每个字符即替换任何minus sign by an en dash 只是为了确定我们正在替换连字符INSIDE该文件夹中的所有 tex 文件我不关心 tex 文件名手动执
Python 中 genfromtxt() 的可变列数？

我有一个 txt具有不同长度的行的文件每一行都是代表一条轨迹的一系列点由于每条轨迹都有自己的长度因此各行的长度都不同也就是说列数从一行到另一行不同据我所知 genfromtxt Python 中的模块要求列数相同 gt gt g
Sorted(key=lambda: ...) 背后的语法[重复]

这个问题在这里已经有答案了我不太明白背后的语法sorted 争论 key lambda variable variable 0 Isn t lambda随意的为什么是variable在看起来像的内容中陈述了两次dict 我认为这里的所有
无法包含外部 pandas 文档 Pycharm v--2018.1.2

我无法包含外部 pandas 文档Pycharm v 2018 1 2 例如 numpy gt http docs scipy org doc numpy reference generated module name element na
将一个时间序列插入到 pandas 中的另一个时间序列中

我有一组定期测量的值说 import pandas as pd import numpy as np rng pd date range 2013 01 01 periods 12 freq H data pd Series np ran
当x轴不连续时如何删除冗余日期时间 pandas DatetimeIndex

我想绘制一个 pandas 系列其索引是无数的 DatatimeIndex 我的代码如下 import matplotlib dates as mdates index pd DatetimeIndex 2000 01 01 00 00
VSCode pytest 测试发现失败

Pytest 测试发现失败用户界面指出 Test discovery error please check the configuration settings for the tests 输出窗口显示 Test Discovery fa
Python unicode 字符代码？

有没有办法将 Unicode 字符插入 Python 3 中的字符串例如 gt gt gt import unicode gt gt gt string This is a full block s unicode charcode U
字典中列表中仅有的几个索引的总和

如果我有这种类型的字典 a dictionary dog white 3 5 black 6 7 Brown 23 1 cat gray 5 6 brown 4 9 bird blue 3 5 green 1 2 yellow 4 9 mo
使用 python 将文本发送到带有逗号分隔符的列

如何使用分隔符在 Excel 中将一列分成两列并使用 python 命名标题这是我的代码 import openpyxl w openpyxl load workbook DDdata xlsx active w active a a
Python int 太大，无法放入 SQLite

我收到错误 OverflowError Python int 太大无法转换为 SQLite INTEGER 来自以下代码块该文件约25GB 因此必须分部分读取 length 6128765 Works on partitions of
在 pip.conf 中指定多个可信主机

这是我尝试在我的中设置的 etc pip conf global trusted host pypi org files pythonhosted org 但是它无法正常工作参考 https pip pypa io en stable
ValueError：无法插入 ID，已存在

我有这个数据 ID TIME 1 2 1 4 1 2 2 3 我想按以下方式对数据进行分组ID并计算每组的平均时间和规模 ID MEAN TIME COUNT 1 2 67 3 2 3 00 1 如果我运行此代码则会收到错误 ValueE
是否可以写一个负的python类型注释

这可能听起来不合理但现在我需要否定类型注释我的意思是这样的 an int Not Iterable a string Iterable 这是因为我为一个函数编写了一个重载而 mypy 不理解我我的功能看起来像这样 overload
Plotly：如何避免巨大的 html 文件大小

我有一个 3D 装箱模型它使用绘图来绘制输出图我注意到绘制了 600 个项目生成 html 文件需要很长时间文件大小为 89M 这太疯狂了我怀疑可能存在一些巨大的重复或者是由单个项目的 add trace 方法引起的阴谋为
Python模块单元测试的最佳文件结构组织？

遗憾的是我发现有太多方法可以在 Python 中保存单元测试而且它们通常没有很好的文档记录我正在寻找一种终极结构它可以满足以下大部分要求 be discoverable by test frameworks including
是否可以强制浮点数的指数或有效数匹配另一个浮点数（Python）？

这是我前几天试图解决的一个有趣的问题是否可以强制一个的有效数或指数float与另一个人一样float在Python中出现这个问题是因为我试图重新调整一些数据以便最小值和最大值与另一个数据集匹配然而我重新调整后的数据略有偏差大约小
Scrapy 蜘蛛无法工作

由于到目前为止没有任何效果我开始了一个新项目 python scrapy ctl py startproject Nu 我完全按照教程操作创建了文件夹和一个新的蜘蛛 from scrapy contrib spiders import
如何对字符串列表进行排序？

在 Python 中创建按字母顺序排序的列表的最佳方法是什么基本回答 mylist b C A mylist sort 这会修改您的原始列表即就地排序要获取列表的排序副本而不更改原始列表请使用sorted http docs pyt

随机推荐

Keras 不在具有 python 3.5 和 Tensorflow 1.4 的 Pycharm 上使用 GPU [重复]

这个问题在这里已经有答案了 from tensorflow python client import device lib def get available gpus local device protos device lib list
使用 Retrofit observable 处理网络错误

当将 Observables 与 Retrofit 结合使用时如何处理网络故障鉴于此代码 Observable
获取字距调整信息

如何获取 GDI 的字距调整信息以供使用获取字距对 http msdn microsoft com en us library dd144895 28v vs 85 29 aspx The 文档 http msdn microsoft co
在 Python Pandas 中查找每日最大值及其时间戳 (yyyy:mm:dd hh:mm:ss)

事实上我有两年来每天每分钟测量的 150 MB 数据我在这里给出了示例数据我想创建一个新的数据框其中包含每天的最大值及其时间戳我的样本数据是 DateTime Power 01 Aug 16 10 43 00 000 229 96
使用 Devart dotConnect 提供程序进行代码优先上下文初始化期间，PostgreSQL 日志中出现错误“列 c.CreatedOn 不存在...”

每当我的上下文初始化时我都会在 PostgresSQL 日志中收到以下错误 2014 06 03 09 51 25 PDT ERROR column c CreatedOn does not exist at character 10 2
hql 加入@CollectionTable

我有一个域名Service有收藏tags如下 Entity public class Service extends AbstractEntity
在 Visual Studio 中刷新自动完成 (IntelliSense) 数据库

我注意到自动完成功能视觉工作室 http en wikipedia org wiki Microsoft Visual Studio一旦我的项目达到一定规模在我的例子中约为 4 100 行代码就不再正常工作我还注意到一旦第三方库的数
C套接字从accept返回的文件描述符中获取IP地址

我知道这个问题看起来很典型并多次回答但我认为如果您阅读详细信息它并不那么常见我没有找到重点是我正在开发一个c 中的 unix 服务打开套接字并等待连接当我有连接时我创建一个新流程来处理它所以可以有同时打开多个连接 int new
如何在 Spring Data 中使用 OrderBy 和 findAll

我正在使用 spring 数据我的 DAO 看起来像 public interface StudentDAO extends JpaRepository
Python 中字典的分组依据和聚合列表

我有一个需要在 Python 中聚合的字典列表 data startDate 123 endDate 456 campaignName abc campaignCfid 789 budgetImpressions 10 startDate
如何解决 java.lang.IllegalStateException: 将图像上传到 Firebase 存储时出现任务尚未完成错误？

将 firebase storage 更新到最新版本 16 0 1 后我开始出现此错误我没有更改代码中的任何内容只是在升级 gradle 构建依赖项后出现此错误询问 Firebase uploadTask addOnComplete
按钮中的文本和图标垂直对齐

我无法将字体精美的图标与 Bootstrap 框架下按钮内的文本垂直对齐我尝试了很多方法包括设置行高但没有任何效果
模板参数、#define 和代码重复

我有很多这样的代码 define WITH FEATURE X struct A ifdef WITH FEATURE X declare some variables Y endif void f void A f do somethin
Perlin Noise 2D：将静态变成云

我正试图将注意力集中在柏林噪音上本文 https web archive org web 20160529013225 http freespace virgin net hugo elias models m perlin htm有所帮
每次添加消息时AWS死信队列Cloudwatch警报

我想触发一个AWS cloudwatch每次有消息添加到我的 DLQ 时都会发出警报我正在使用云形成来部署我的 sqs dlq 资源但我不知道如何配置此类警报配置警报时需要使用 NumberOfMessagesSent 请参阅Clou
以 Grid 作为模板的 ItemsControl ：向 Grid 添加控件

Windows Phone 7 1 项目 XAML 我有一个以网格为模板的项目控件绑定到数据元素的集合一切正常但是我必须向网格添加一张额外的图像该图像不会绑定到集合某种标题图像我有这个代码
每个构建类型的 resConfigs

我怎样才能覆盖resConfigs每个构建类型我读到口味允许这样做但我不使用它们我只想为我的调试构建另一组受支持的语言这是我尝试过的 buildTypes debug resConfigs de en allow also germ
附加数据框中所有行中的单词或字符列表

有没有一种方法可以在不使用 for 循环的情况下附加数据框中不同行中存在的列表我可以通过使用 for 循环来实现这一点但我想以更有效的方式实现这一点可能不使用 for 循环 d col1 1 2 3 4 5 col2 a a b c
如何转义url中的#符号？

我有符号作为参数传递到我的 URL 中但它会丢弃后面的所有参数值请建议我解决方案以下是我的网址 GetConnectiont customerID customer1 activenode Sv50 parent server f
在 Python Pandas Dataframe 中动态添加列的数据处理

我有以下问题假设这是我的 CSV id f1 f2 f3 1 4 5 5 1 3 1 0 1 7 4 4 1 4 3 1 1 1 4 6 2 2 6 0 所以我有可以按 id 分组的行我想创建一个如下所示的 csv 作为输出 f1 f

在 Python Pandas Dataframe 中动态添加列的数据处理

在 Python Pandas Dataframe 中动态添加列的数据处理 的相关文章

随机推荐

热门标签

在 Python Pandas Dataframe 中动态添加列的数据处理的相关文章