如何提高Python循环速度?

2024-02-08

我有一个包含 370k 记录的数据集,存储在 Pandas Dataframe 中,需要集成。我尝试了多处理、线程、Cpython 和循环展开。但我没有成功,显示的计算时间是 22 小时。任务如下:

%matplotlib inline  
from numba import jit, autojit
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

with open('data/full_text.txt', encoding = "ISO-8859-1") as f:
 strdata=f.readlines()
data=[]

for string in strdata:
 data.append(string.split('\t'))

df=pd.DataFrame(data,columns=["uname","date","UT","lat","long","msg"])

df=df.drop('UT',axis=1)

df[['lat','long']] = df[['lat','long']].apply(pd.to_numeric)

from textblob import TextBlob
from tqdm import tqdm

df['polarity']=np.zeros(len(df))

线程:

 from queue import Queue
 from threading import Thread
 import logging
 logging.basicConfig(
 level=logging.DEBUG,
  format='(%(threadName)-10s) %(message)s',
  )


class DownloadWorker(Thread):
   def __init__(self, queue):
       Thread.__init__(self)
       self.queue = queue

   def run(self):
       while True:
           # Get the work from the queue and expand the tuple
         lowIndex, highIndex = self.queue.get()
         a = range(lowIndex,highIndex-1)
         for i in a:
            df['polarity'][i]=TextBlob(df['msg'][i]).sentiment.polarity
         self.queue.task_done()

  def main():
   # Create a queue to communicate with the worker threads
   queue = Queue()
   # Create 8 worker threads
   for x in range(8):
     worker = DownloadWorker(queue)
     worker.daemon = True
     worker.start()
  # Put the tasks into the queue as a tuple
   for i in tqdm(range(0,len(df)-1,62936)):
     logging.debug('Queueing')
     queue.put((i,i+62936 ))
     queue.join()
     print('Took {}'.format(time() - ts))

 main()

带循环展开的多处理:

pool = multiprocessing.Pool(processes=2)
r = pool.map(assign_polarity, df)
pool.close()

def assign_polarity(df):
   a=range(0,len(df),5)
   for i in tqdm(a):
       df['polarity'][i]=TextBlob(df['msg'][i]).sentiment.polarity
       df['polarity'][i+1]=TextBlob(df['msg'][i+1]).sentiment.polarity
       df['polarity'][i+2]=TextBlob(df['msg'][i+2]).sentiment.polarity
       df['polarity'][i+3]=TextBlob(df['msg'][i+3]).sentiment.polarity
       df['polarity'][i+4]=TextBlob(df['msg'][i+4]).sentiment.polarity

如何提高计算速度?或者以更快的方式将计算存储在数据框中?我的笔记本电脑配置

  • Ram: 8GB
  • 物理核心:2
  • 逻辑核心:8
  • Windows 10

实现多重处理给了我更长的计算时间。 线程是按顺序执行的(我认为是因为 GIL) 循环展开给了我相同的计算速度。 Cpython 在导入库时给我错误。


ASD——我注意到迭代地在 df 中存储一些东西非常慢。我会尝试将 TextBlob 存储在列表(或其他结构)中,然后将该列表转换为 df.txt 的列。

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

如何提高Python循环速度? 的相关文章

随机推荐