我是 Nicola,一名 Python 新用户,没有真正的计算机编程背景。因此,我确实需要一些帮助来解决我遇到的问题。我编写了一段代码来从此网页抓取数据:
基本上,我的代码的目标是从页面中的所有表中抓取数据并将它们写入 txt 文件中。
这里我粘贴我的代码:
#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os
def extract(soup):
table = soup.findAll("table")[1]
for row in table.findAll('tr')[1:19]:
col = row.findAll('td')
voce = col[0].string
accertamento = col[1].string
competenza = col[2].string
residui = col[3].string
record = (voce, accertamento, competenza, residui)
print >> outfile, "|".join(record)
table = soup.findAll("table")[2]
for row in table.findAll('tr')[1:21]:
col = row.findAll('td')
voce = col[0].string
accertamento = col[1].string
competenza = col[2].string
residui = col[3].string
record = (voce, accertamento, competenza, residui)
print >> outfile, "|".join(record)
table = soup.findAll("table")[3]
for row in table.findAll('tr')[1:44]:
col = row.findAll('td')
voce = col[0].string
accertamento = col[1].string
competenza = col[2].string
residui = col[3].string
record = (voce, accertamento, competenza, residui)
print >> outfile, "|".join(record)
table = soup.findAll("table")[4]
for row in table.findAll('tr')[1:18]:
col = row.findAll('td')
voce = col[0].string
accertamento = col[1].string
competenza = col[2].string
residui = col[3].string
record = (voce, accertamento, competenza, residui)
print >> outfile, "|".join(record)
table = soup.findAll("table")[5]
for row in table.findAll('tr')[1:]:
col = row.findAll('td')
voce = col[0].string
accertamento = col[1].string
competenza = col[2].string
residui = col[3].string
record = (voce, accertamento, competenza, residui)
print >> outfile, "|".join(record)
table = soup.findAll("table")[6]
for row in table.findAll('tr')[1:]:
col = row.findAll('td')
voce = col[0].string
accertamento = col[1].string
competenza = col[2].string
residui = col[3].string
record = (voce, accertamento, competenza, residui)
print >> outfile, "|".join(record)
outfile = open("modena_quadro02.txt", "w")
br = Browser()
br.set_handle_robots(False)
url = "http://finanzalocale.interno.it/sitophp/showQuadro.php?codice=2080500230&tipo=CO&descr_ente=MODENA&anno=2009&cod_modello=CCOU&sigla=MO&tipo_cert=C&isEuro=0&quadro=02"
page1 = br.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1)
outfile.close()
一切都会正常工作,但该页面中某些表格的第一列包含带重音字符的单词。
当我运行代码时,我得到以下信息:
Traceback (most recent call last):
File "modena2.py", line 158, in <module>
extract(soup1)
File "modena2.py", line 98, in extract
print >> outfile, "|".join(record)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 32: ordinal not in range(128)
我知道问题出在重音字符的编码上。我试图找到解决方案,但这确实超出了我的知识范围。
我要提前感谢所有愿意帮助我的人。我真的很感激!
很抱歉,如果这个问题太基础了,但是,正如我所说,我刚刚开始使用 python,我正在自学一切。
谢谢!
尼古拉