【数据处理】 python 常用操作整理

2023-11-16

python 数据分析常用操作

这是本人在数据分析中，记不住，反复查询的一些命令汇总，在此做个归纳汇总，并不定期更新。

Dataframe

import pandas as pd

合并DF

需求：有的时候需要将多个列相同的数据集(如别人的训练集和测试集)合并后再分析。
代码：

pd.concat([df1, df2])

参考：https://www.cnblogs.com/guxh/p/9451532.html

重置DF索引

需求：重组后的DF需要重置索引，通常发生在选择或排序操作后。
代码：

train = train.reset_index(drop=True)

参考：https://blog.csdn.net/qq_36523839/article/details/80640139

选取DF前几列

需求：分析的某个步骤中仅对DF的某些列进行分析。
限制：不知道列名。
代码：

sub_train = train.iloc[:,0:2]

对DF插入一列

需求：分析产生的派生数据需要放到DF里面。
限制：指定插入列的位置
代码：

df.insert(1,'d',np.ones(4))

参考：https://blog.csdn.net/brucewong0516/article/details/82493080

DF完全显示

需求：有些行、列太长了，显示不全，但预览时想看。
代码：

#显示所有列
pd.set_option('display.max_columns', None)
#显示所有行
pd.set_option('display.max_rows', None)

参考：https://blog.csdn.net/qq_34862636/article/details/102581675

DF取某一列的唯一值，并可视化其分布

需求：本来用numpy的unique就可以解决，但如果要取的列不是数字，而是字符串，这个时候用DF的操作更好。
代码：

data = pd.read_csv('event.csv',dtype='str',header = 0)
city_set = data['city'].value_counts()
city_set[0:20].plot(kind='bar', title='Events in different city')
plt.show()

在这里插入图片描述

DF类数据库查询

需求：DF本质就是一张表，如同关系数据库一般，有的时候要对数据进行较为复杂的查询。

DF条件查询 (Where)

choose_data = data[data['col'] == val] #单表单条件
choose_data = data[(data['col1'] == val1) & (data['col2'] == val2)] #单表多条件

参考：https://bbs.pinggu.org/thread-4608666-1-1.html

DF判断是否在集合里(IN)

valid_year = np.arange(2010,2019,1) #条件
idx = data[data.year.isin(valid_year)].index #符合条件的索引
valid_data = data.iloc[idx,:] #按索引取值

参考：https://www.cnblogs.com/shadow1/p/10700264.html

DF做表连接 (join)

data = data.set_index('event', drop=True) #col = ['user','event']
data2 = data2.set_index('event', drop=True) #col = ['event','venue','time','group']
d = data.join(data2,on='event',how='left') #col = ['event','venue','time','group']
d = d.reset_index(drop=False)
d = d[['user','event','time','venue','group']] #col = ['user','event','time','venue','group']

参考:https://blog.csdn.net/claroja/article/details/71023167

DF按需排序 (Order by)

需求：划分数据集时按时间划分，得先排序。
限制：按指定列顺序排序，升序
代码：

data = data.sort_values(by=['user'],axis=0, ascending=True)

参考：https://blog.csdn.net/sinat_22147265/article/details/81284688

DF分类汇总 (Group by)

需求：同数据库中的分类汇总，以计数的汇总为例
代码：

sc = sub_train.groupby(['user','item']).count()

参考：http://everyang.net/787/

DF去重 (Distinct)

需求：通常根据不同的需求，按部分重复(某几列)和全重复进行去重。
代码：

testlist = list(test_data[test_data['user']==user]['item'].drop_duplicates()) #去重保留第一个

参考：https://www.cnblogs.com/zenan/p/8616508.html

DF删除空行

需求：通常根据不同的需求，删除空行。
代码：

df=df[~(df['col'].isnull())] #删掉空行
 
df=df.dropna(axis=0)  #删除有空值的行，使用参数axis=0
 
df=df.dropna(axis=1)  #删除有空值的列，使用参数axis=1

参考：https://blog.csdn.net/weixin_45852947/article/details/119453881

DF 根据已有的列通过运算得到派生列

需求：通过多个列运算得到目标值
代码：

real_cost = JHSH_data.apply(lambda x: float(x['优惠金额']) - float(x['商户分摊成本1']), axis = 1)

将数据保存至同一excel不同sheet中

需求：直接用pd.to_excel()指定sheet_name时，即使指定sheet_name也没用，后一次的写入数据会覆盖前一次写入的数据，解决方法如下，借助pandas中的ExcelWriter方法；
代码：

# 方法1，推荐方法
 with pd.ExcelWriter('test.xlsx') as writer:
     data.to_excel(writer, sheet_name='data')
     data2.to_excel(writer, sheet_name='data2')
 
 # 写法2
 writer = pd.ExcelWriter('test.xlsx')
 data.to_excel(writer, sheet_name='data')
 data.to_excel(writer, sheet_name='data2')
 writer.save()
 writer.close()

Numpy

import numpy as np

NP随机初始化

需求：对初值进行随机赋值。
代码：

P = np.random.uniform(low=0,high=1,size=[N,d]) #N*d的矩阵，值在0~1之间
userP[u]=np.random.normal(0,0.01,dimension) #d维向量，正态分布X~N(0,0.01)随机赋值

参考：https://www.cnblogs.com/JetReily/p/9398148.html

Dictionary

字典遍历

需求：字典数据结构一般存储稀疏矩阵比较方便，能节省空间。
代码：

for key,value in dicts.items():
    print('key is:',key,'value is',value)

参考：https://blog.csdn.net/u010589524/article/details/86499394

字典一键对应多值

需求：通常列与列之间的交互是多对多的关系，可由一键多值的字典表达出来。
代码：

from collections import defaultdict
rating = defaultdict(set)
    for i in range(0,len(train)):
	    user = train.iloc[i]['user']
	    item = train.iloc[i]['item']
	    score = train.iloc[i]['score']
	    rating[user].add(item)

OS

判断文件夹是否存在，没有就创建

需求：由程序格式化创建文件夹，方便数据归档
代码：

if os.path.exists('./myfile'):
    pass
else:
    os.mkdir('./myfile')

参考：https://www.cnblogs.com/VseYoung/p/9941873.html

Time

时间戳转日期

import time
timeStamp = 1381419600
timeArray = time.localtime(timeStamp)
print(time.strftime("%Y/%m/%d %H:%M:%S", timeArray))

参考：https://www.cnblogs.com/jfl-xx/p/8024596.html
…未完待续，持续更新中…

画图

图片高清保存，且完整显示

plt.savefig('Events in different city',dpi=300, bbox_inches = 'tight')

特定函数

解压函数

def unzip_file(zip_src,dst_dir):
    r = zipfile.is_zipfile(zip_src)
    
    if r:     
        fz = zipfile.ZipFile(zip_src, 'r')
        for file in fz.namelist():
            fz.extract(file, dst_dir)

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

数据处理

python

pandas

NumPy

字典