这里是Python新手。
想象一个 csv 文件,如下所示:
(...除了在现实生活中,Person 列中有 20 个不同的名称,每个 Person 有 300-500 行。此外,还有多个数据列,而不仅仅是一个。)
我想做的是randomly标记每个人行的 10% 并将其标记在新列中。我想出了一个极其复杂的方法来做到这一点——它涉及创建一个由随机数和各种不必要的复杂的辅助列组成的辅助列。它有效,但很疯狂。最近,我想出了这个:
import pandas as pd
df = pd.read_csv('source.csv')
df['selected'] = ''
names= list(df['Person'].unique()) #gets list of unique names
for name in names:
df_temp = df[df['Person']== name]
samp = int(len(df_temp)/10) # I want to sample 10% for each name
df_temp = df_temp.sample(samp)
df_temp['selected'] = 'bingo!' #a new column to mark the rows I've randomly selected
df = df.merge(df_temp, how = 'left', on = ['Person','data'])
df['temp'] =[f"{a} {b}" for a,b in zip(df['selected_x'],df['selected_y'])]
#Note: initially instead of the line above, I tried the line below, but it didn't work too well:
#df['temp'] = df['selected_x'] + df['selected_y']
df = df[['Person','data','temp']]
df = df.rename(columns = {'temp':'selected'})
df['selected'] = df['selected'].str.replace('nan','').str.strip() #cleans up the column
正如你所看到的,本质上我正在为每个人提取一个临时 DataFrame,使用DF.sample(number)
进行随机化,然后使用DF.merge
将“标记”行放回到原始数据帧中。它涉及迭代列表来创建每个临时 DataFrame...我的理解是迭代有点蹩脚。
必须有一种更Pythonic、矢量化的方法来做到这一点,对吧?无需迭代。也许涉及到一些事情groupby
?非常感谢任何想法或建议。
编辑:这是另一种避免的方法merge
...但它仍然很笨重:
import pandas as pd
import math
#SETUP TEST DATA:
y = ['Alex'] * 2321 + ['Doug'] * 34123 + ['Chuck'] * 2012 + ['Bob'] * 9281
z = ['xyz'] * len(y)
df = pd.DataFrame({'persons': y, 'data' : z})
df = df.sample(frac = 1) #shuffle (optional--just to show order doesn't matter)
percent = 10 #CHANGE AS NEEDED
#Add a 'helper' column with random numbers
df['rand'] = np.random.random(df.shape[0])
df = df.sample(frac=1) #this shuffles data, just to show order doesn't matter
#CREATE A HELPER LIST
helper = pd.DataFrame(df.groupby('persons'['rand'].count()).reset_index().values.tolist()
for row in helper:
df_temp = df[df['persons'] == row[0]][['persons','rand']]
lim = math.ceil(len(df_temp) * percent*0.01)
row.append(df_temp.nlargest(lim,'rand').iloc[-1][1])
def flag(name,num):
for row in helper:
if row[0] == name:
if num >= row[2]:
return 'yes'
else:
return 'no'
df['flag'] = df.apply(lambda x: flag(x['persons'], x['rand']), axis=1)