我目前正在使用内部联接合并两个数据帧。但是,合并后,即使我合并的列包含相同的值,我也会看到所有行都是重复的。
具体来说,我有以下代码。
merged_df = pd.merge(df1, df2, on=['email_address'], how='inner')
这是两个数据框和结果。
df1
email_address name surname
0 [email protected] /cdn-cgi/l/email-protection john smith
1 [email protected] /cdn-cgi/l/email-protection john smith
2 [email protected] /cdn-cgi/l/email-protection elvis presley
df2
email_address street city
0 [email protected] /cdn-cgi/l/email-protection street1 NY
1 [email protected] /cdn-cgi/l/email-protection street1 NY
2 [email protected] /cdn-cgi/l/email-protection street2 LA
merged_df
email_address name surname street city
0 [email protected] /cdn-cgi/l/email-protection john smith street1 NY
1 [email protected] /cdn-cgi/l/email-protection john smith street1 NY
2 [email protected] /cdn-cgi/l/email-protection john smith street1 NY
3 [email protected] /cdn-cgi/l/email-protection john smith street1 NY
4 [email protected] /cdn-cgi/l/email-protection elvis presley street2 LA
5 [email protected] /cdn-cgi/l/email-protection elvis presley street2 LA
我的问题是,不应该是这样吗?
这就是我想要的merged_df
像是。
email_address name surname street city
0 [email protected] /cdn-cgi/l/email-protection john smith street1 NY
1 [email protected] /cdn-cgi/l/email-protection john smith street1 NY
2 [email protected] /cdn-cgi/l/email-protection elvis presley street2 LA
有什么方法可以实现这一目标吗?
list_2_nodups = list_2.drop_duplicates()
pd.merge(list_1 , list_2_nodups , on=['email_address'])
重复的行是预期的。每个约翰·史密斯list_1
与每个约翰·史密斯匹配list_2
。我不得不将重复项放入其中一个列表中。我选择了list_2
.
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)