psycopg2 相当于 R dbWriteTable 命令并从 python 代码中获得更多性能

2024-03-09

我刚刚做了一个时间测试,从 csv 加载数据数组,在 Postgres 上创建数据库并在 python 和 R 中将表写入其中。

令我惊讶的是,时间非常相似:

首先是Python代码:(例如)

import timeit
tic = timeit.default_timer()
tic4 = timeit.default_timer()
import xlrd as xl
import psycopg2 as pq
import os
import pandas as pd 
import numpy as np
import csv
from pprint import pprint as pp

perf_dir = '/myhomedir'
toc4=timeit.default_timer()

# Create the database
tic1= timeit.default_timer()
os.system('dropdb ptest')
os.system('createdb ptest')
# connect to the database
cn = pq.connect('dbname=ptest user=me')
cr = cn.cursor()
toc1=timeit.default_timer()

# Create the tables:
### load csvs
tic2=timeit.default_timer()
id_files = ('di1','di2','di.c')
id_files = [i+r'.csv' for i in id_files]
id1 = csv.reader(open(os.path.join(perf_dir,id_files[0])),delimiter='\t')
id1 = [i for i in id1]

id2 = csv.reader(open(os.path.join(perf_dir,id_files[1])))
id2 = [i for i in id2]

id3 = csv.reader(open(os.path.join(perf_dir,id_files[2])),delimiter='\t')
id3 = [i for i in id3]
id3 = [i[1:3] for i in id3]
toc2=timeit.default_timer()

# create tables and fill
### id1 fund classifications
tic3=timeit.default_timer()
cr.execute('CREATE TABLE id1 (%s varchar, %s int PRIMARY KEY, %s int, %s int, %s varchar)' % tuple(id1[0]))
FLDS = 'INSERT INTO id1 (%s,%s,%s,%s,%s) VALUES' % tuple(id1[0]) 
SQL = FLDS + ' (%s,%s,%s,%s,%s)'
for i in range(1,len(id1)):
    data = tuple(id1[i])
    cr.execute(SQL,data)

### id2 portfolio group classifications - reference only
cr.execute('CREATE TABLE id2 (%s varchar, %s int PRIMARY KEY, %s int)' % tuple(id2[0]))
SQL = 'INSERT INTO id2 (%s,%s,%s) VALUES' % tuple(id2[0]) + ' (%s,%s,%s)'
for i in range(1,len(id2)):
    data = tuple(id2[i])
    cr.execute(SQL,data)

### id3 value variable classifications
cr.execute('CREATE TABLE id3 (%s varchar,%s varchar)' % tuple(id3[0]))
SQL = 'INSERT INTO id3 VALUES(%s,%s)'
for i in range(1,len(id3)):
    data = tuple(id3[i])
    cr.execute(SQL,data)

cn.commit()

# Timing block - will be commented out in final code
toc3=timeit.default_timer()
toc = timeit.default_timer()
time = (toc - tic)
time1 = toc1 - tic1
time2 = toc2 - tic2
time3 = toc3 - tic3
time4 = toc4 - tic4
print('Overall time: %s' % time)
print('dB create & connect time: %s' % time1)
print('Load id csvs time: %s' % time2)
print('Create tables and write to db time: %s' % time3)
print('Time to import libraries: %s' % time4)

和 R 代码(更清晰)

tic = proc.time()
library(RPostgreSQL)

tic1 = proc.time()
system('dropdb ptest1')
system('createdb ptest1')
drv = dbDriver("PostgreSQL")
con = dbConnect(drv, dbname='ptest1')
toc1 = proc.time()
time1 = toc1 - tic1

tic2 = proc.time()
id.1 = read.csv('/myhomedir/di1.csv',stringsAsFactors=F,sep='\t')
id.2 = read.csv('/myhomedir/di2.csv',stringsAsFactors=F)
id.3 = read.csv('/myhomedir/di.c.csv',stringsAsFactors=F,sep='\t')
id.3 = id.3[,-1]
toc2 = proc.time()
time2 = toc2 - tic2

tic3 = proc.time()
dbWriteTable(con,'id1',id.1)
dbWriteTable(con,'id2',id.2)
dbWriteTable(con,'id3',id.3)
toc3 = proc.time()
time3 = toc3 - tic3

toc = proc.time()
time = toc - tic
tyme = rbind(time1,time2,time3,time)
tyme = data.frame(Function=c('Create & Connect to DB',"Load CSV's for save","Write Table to DB",'Overall Time'),tyme)

我很惊讶两人的时间如此接近。 (我读过很多关于 R 很慢而 Python 非常快的文章)

对于蟒蛇

>>> Overall time: 2.48381304741
dB create & connect time: 1.96832108498
Load id csvs time: 0.000378847122192
Create tables and write to db time: 0.35303401947
Time to import libraries: 0.162075042725

对于 R

                    Function user.self sys.self elapsed user.child sys.child
time1 Create & Connect to DB     0.112    0.016   1.943       0.06     0.004
time2    Load CSV's for save     0.008    0.000   0.006       0.00     0.000
time3      Write Table to DB     0.096    0.004   0.349       0.00     0.000
time            Overall Time     0.376    0.028   2.463       0.06     0.004

我想知道这是否与我的事实有关INSERT一次将一行放入表的 python 版本中。

因此主要问题是 - python 中是否有等效的dbWriteTable阻止 R 代码,它会加快速度吗?

第二个辅助问题是代码中是否存在任何明显错误可能会减慢速度。

如果有帮助的话,很高兴提供 csv 示例。

不想在 R 与 Python 上展开激烈的争论,只是想知道如何使我的代码更快。

Thanks


尝试这个:

    id2 = csv.reader(open(os.path.join(perf_dir,id_files[1])))
    h = tuple(next(id2))

    create = '''CREATE TABLE id2 (%s varchar, %s int PRIMARY KEY, %s int)''' % h
    insert = '''INSERT INTO id2 (%s, %s, %s) VALUES (%%s, %%s, %%s)''' % h

    ...

    cr.executemany(insert, id2)

对所有行仅传递一次 sql 命令。除了发出额外 sql 命令的性能开销之外,请注意 sql 字符串本身的大小(以字节为单位)可能会使测试中的实际数据的大小相形见绌。

使用 id2 作为没有数组的生成器还应该避免将整个数据集加载到内存中。

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

psycopg2 相当于 R dbWriteTable 命令并从 python 代码中获得更多性能 的相关文章

随机推荐