从具有特定模式的 txt 文件创建 Pandas DataFrame

2024-04-15

我需要基于以下结构的文本文件创建一个 Pandas DataFrame:

Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas[edit]

带有“[edit]”的行是州,第 [number] 行是地区。我需要拆分以下内容,然后为每个区域名称重复州名称。

Index          State          Region Name
0              Alabama        Aurburn...
1              Alabama        Florence...
2              Alabama        Jacksonville...
...
9              Alaska         Fairbanks...
10             Alaska         Arizona...
11             Alaska         Flagstaff...

熊猫数据框

我不知道如何将基于“[编辑]”和“[数字]”或“(字符)”的文本文件拆分到相应的列中,并为每个区域名称重复州名称。请任何人给我一个起点来完成以下任务。


你可以先read_csv http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html带参数name用于创建DataFrame带柱Region Name,分隔符是不在值中的值(例如;):

df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])

Then insert http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.insert.html新专栏State with extract http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html文本所在的行[edit] and replace http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html所有值来自(到列的末尾Region Name.

df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df['Region Name'] = df['Region Name'].str.replace(r' \(.+$', '')

最后删除文本所在的行[edit] by boolean indexing http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing,掩模是由str.contains http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html:

df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
      State   Region Name
0   Alabama        Auburn
1   Alabama      Florence
2   Alabama  Jacksonville
3   Alabama    Livingston
4   Alabama    Montevallo
5   Alabama          Troy
6   Alabama    Tuscaloosa
7   Alabama      Tuskegee
8    Alaska     Fairbanks
9   Arizona     Flagstaff
10  Arizona         Tempe
11  Arizona        Tucson

如果需要所有值解决方案更容易:

df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
      State                                        Region Name
0   Alabama                      Auburn (Auburn University)[1]
1   Alabama             Florence (University of North Alabama)
2   Alabama    Jacksonville (Jacksonville State University)[2]
3   Alabama         Livingston (University of West Alabama)[2]
4   Alabama           Montevallo (University of Montevallo)[2]
5   Alabama                          Troy (Troy University)[2]
6   Alabama  Tuscaloosa (University of Alabama, Stillman Co...
7   Alabama                  Tuskegee (Tuskegee University)[5]
8    Alaska      Fairbanks (University of Alaska Fairbanks)[2]
9   Arizona         Flagstaff (Northern Arizona University)[6]
10  Arizona                   Tempe (Arizona State University)
11  Arizona                     Tucson (University of Arizona)
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

从具有特定模式的 txt 文件创建 Pandas DataFrame 的相关文章

随机推荐