将其视为两个独立的任务:
- 从“脏”源(此 CSV 文件)收集一些数据项
- 将这些数据存储在某个地方,以便可以轻松地以编程方式访问和操作(根据您想要用它做什么)
处理脏 CSV
做到这一点的一种方法是有一个函数deserialize_business()
从 CSV 中的每个传入行中提取结构化业务信息。该函数可能很复杂,因为这是任务的本质,但仍然建议将其拆分为独立的较小函数(例如get_outlets()
, get_headings()
, 等等)。该函数可以返回一个字典,但根据您的需要,它可以是一个[命名]元组、一个自定义对象等。
该函数将是该特定 CSV 数据源的“适配器”。
反序列化函数示例:
def deserialize_business(csv_line):
"""
Distills structured business information from given raw CSV line.
Returns a dictionary like {name, phone, owner,
btype, yoe, headings[], outlets[], category}.
"""
pieces = [piece.strip("[[\"\']] ") for piece in line.strip().split(',')]
name = pieces[0]
phone = pieces[1]
owner = pieces[2]
btype = pieces[3]
yoe = pieces[4]
# after yoe headings begin, until substring Outlets Address
headings = pieces[4:pieces.index("Outlets Address")]
# outlets go from substring Outlets Address until category
outlet_pieces = pieces[pieces.index("Outlets Address"):-1]
# combine each individual outlet information into a string
# and let ``deserialize_outlet()`` deal with that
raw_outlets = ', '.join(outlet_pieces).split("Outlets Address")
outlets = [deserialize_outlet(outlet) for outlet in raw_outlets]
# category is the last piece
category = pieces[-1]
return {
'name': name,
'phone': phone,
'owner': owner,
'btype': btype,
'yoe': yoe,
'headings': headings,
'outlets': outlets,
'category': category,
}
调用示例:
with open("phonebookCOMPK-Directory.csv") as f:
lineno = 0
for line in f:
lineno += 1
try:
business = deserialize_business(line)
except:
# Bad line formatting?
log.exception(u"Failed to deserialize line #%s!", lineno)
else:
# All is well
store_business(business)
存储数据
您将拥有store_business()
函数获取您的数据结构并将其写入某处。也许它会是另一个结构更好的 CSV,也许是多个 CSV、一个 JSON 文件,或者您可以使用 SQLite 关系数据库工具,因为 Python 内置了它。
这完全取决于您以后想做什么。
关系示例
在这种情况下,您的数据将分布在多个表中。 (我使用“表”这个词,但它可以是 CSV 文件,尽管您也可以使用 SQLite DB,因为 Python 具有内置功能。)
识别所有可能的业务标题的表格:
business heading ID, name
1, Abattoirs
2, Exporters
3, Food Delivery
4, Butchers Retail
5, Meat Dealers-Retail
6, Meat Freezer
7, Meat Packers
标识所有可能类别的表:
category ID, parent category, name
1, NULL, "Agriculture, fishing & Forestry"
2, 1, "Farming equipment & services"
3, 2, "Abattoirs in Pakistan"
识别企业的表格:
business ID, name, phone, owner, type, yoe, category
1, Meat One, +92-21-111163281, Al Shaheer Corporation, Retailers, 2008, 3
描述其销售点的表格:
business ID, city, address, landmarks, phone
1, Karachi UAN, "Shop 13, Ground Floor, Plot 14-D, Sky Garden, Main Tipu Sultan Road, KDA Scheme No.1, Karachi", "Nadra Chowrangi, Sky Garden, Tipu Sultan Road", +92-21-111163281
1, Karachi UAN, "Near Jan's Broast, Boat Basin, Khayaban-e-Roomi, Block 5, Clifton, Karachi", "Boat Basin, Jans Broast, Khayaban-e-Roomi", +92-21-111163281
描述其标题的表格:
business ID, business heading ID
1, 1
1, 2
1, 3
…
处理所有这些需要一个复杂的store_business()
功能。如果采用关系方式保存数据,可能值得研究一下 SQLite 和一些 ORM 框架。