Python 和 JSON：ValueError：未终止的字符串始于：

2024-04-18

我读过多篇关于此问题的 StackOverflow 文章以及大部分 Google 前 10 名结果。我的问题的不同之处在于我使用 python 中的一个脚本来创建 JSON 文件。不到 10 分钟后运行的下一个脚本无法读取该文件。

简而言之，我为我的在线业务生成潜在客户。我正在尝试学习Python，以便更好地分析这些线索。我正在搜寻 2 年的潜在客户，目的是保留有用的数据并删除任何个人信息（电子邮件地址、姓名等），同时还将 30,000 多个潜在客户保存到几十个文件中以便于访问。

因此，我的第一个脚本打开每个单独的潜在客户文件 - 30,000+ - 根据文件中的时间戳确定捕获的日期。然后它会保存指向 dict 中相应键的信息。当所有数据都聚合到该字典中时，将使用 json.dumps 写入文本文件。

该字典的结构是：

addData['lead']['July_2013'] = { ... }

其中“lead”键可以是“lead”、“partial”和其他一些键，“July_2013”键显然是一个基于日期的键，可以是完整月份和 2013 年或 2014 年的任意组合，可追溯到“February_2013”。

完整的错误是这样的：

ValueError: Unterminated string starting at: line 1 column 9997847 (char 9997846)

但我手动查看了该文件，我的 IDE 显示该文件中只有 76,655 个字符。那么它是如何达到9997846的呢？

失败的文件是第8个要读取的文件；其他 7 个文件以及之后通过 json.loads 读入的所有其他文件都很好。

Python 说存在未终止的字符串，因此我查看了失败的文件中 JSON 的末尾，看起来没问题。我见过一些关于 JSON 中换行符 \n 的提及，但该字符串都是一行。我见过提到 \ vs \ 但在快速浏览整个文件时我没有看到任何 .其他文件确实有 \ 并且它们读起来很好。而且，这些文件都是由 json.dumps 创建的。

我无法发布该文件，因为其中仍然包含个人信息。手动尝试验证 76,000 个字符文件的 JSON 并不真正可行。

关于如何调试这个的想法将不胜感激。与此同时，我将尝试重建文件，看看这是否不仅仅是一次性错误，而是需要一段时间。

通过 Spyder 和 Anaconda 的 Python 2.7
Windows 7 专业版

- - 编辑 - - 根据请求，我在这里发布编写代码：

from p2p.basic import files as f
from p2p.adv import strTools as st
from p2p.basic import strTools as s

import os
import json
import copy
from datetime import datetime
import time


global leadDir
global archiveDir
global aggLeads


def aggregate_individual_lead_files():
    """

    """

    # Get the aggLead global and 
    global aggLeads

    # Get all the Files with a 'lead' extension & aggregate them
    exts = [
        'lead',
        'partial',
        'inp',
        'err',
        'nobuyer',
        'prospect',
        'sent'
    ]

    for srchExt in exts:
        agg = {}
        leads = f.recursiveGlob(leadDir, '*.cd.' + srchExt)
        print "There are {} {} files to process".format(len(leads), srchExt)

        for lead in leads:
            # Get the Base Filename
            fname = f.basename(lead)
            #uniqID = st.fetchBefore('.', fname)

            #print "File: ", lead

            # Get Lead Data
            leadData = json.loads(f.file_get_contents(lead))

            agg = agg_data(leadData, agg, fname)

        aggLeads[srchExt] = copy.deepcopy(agg)

        print "Aggregate Top Lvl Keys: ", aggLeads.keys()
        print "Aggregate Next Lvl Keys: "

        for key in aggLeads:
            print "{}: ".format(key)

            for arcDate in aggLeads[key].keys():
                print "{}: {}".format(arcDate, len(aggLeads[key][arcDate]))

        # raw_input("Press Enter to continue...")


def agg_data(leadData, agg, fname=None):
    """

    """
    #print "Lead: ", leadData

    # Get the timestamp of the lead
    try:
        ts = leadData['timeStamp']
        leadData.pop('timeStamp')
    except KeyError:
        return agg

    leadDate = datetime.fromtimestamp(ts)
    arcDate = leadDate.strftime("%B_%Y")

    #print "Archive Date: ", arcDate

    try:
        agg[arcDate][ts] = leadData
    except KeyError:
        agg[arcDate] = {}
        agg[arcDate][ts] = leadData
    except TypeError:
        print "Timestamp: ", ts
        print "Lead: ", leadData
        print "Archive Date: ", arcDate
        return agg

    """
    if fname is not None:
        archive_lead(fname, arcDate)
    """

    #print "File: {} added to {}".format(fname, arcDate)

    return agg


def archive_lead(fname, arcDate):
    # Archive Path
    newArcPath = archiveDir + arcDate + '//'

    if not os.path.exists(newArcPath):
        os.makedirs(newArcPath)

    # Move the file to the archive
    os.rename(leadDir + fname, newArcPath + fname)


def reformat_old_agg_data():
    """

    """

    # Get the aggLead global and 
    global aggLeads
    aggComplete = {}
    aggPartial = {}

    oldAggFiles = f.recursiveGlob(leadDir, '*.cd.agg')
    print "There are {} old aggregate files to process".format(len(oldAggFiles))

    for agg in oldAggFiles:
        tmp = json.loads(f.file_get_contents(agg))

        for uniqId in tmp:
            leadData = tmp[uniqId]

            if leadData['isPartial'] == True:
                aggPartial = agg_data(leadData, aggPartial)
            else:
                aggComplete = agg_data(leadData, aggComplete)

    arcData = dict(aggLeads['lead'].items() + aggComplete.items())
    aggLeads['lead'] = arcData

    arcData = dict(aggLeads['partial'].items() + aggPartial.items())
    aggLeads['partial'] = arcData    


def output_agg_files():
    for ext in aggLeads:
        for arcDate in aggLeads[ext]:
            arcFile = leadDir + arcDate + '.cd.' + ext + '.agg'

            if f.file_exists(arcFile):
                tmp = json.loads(f.file_get_contents(arcFile))
            else:
                tmp = {}

            arcData = dict(tmp.items() + aggLeads[ext][arcDate].items())

            f.file_put_contents(arcFile, json.dumps(arcData))


def main():
    global leadDir
    global archiveDir
    global aggLeads

    leadDir = 'D://Server Data//eagle805//emmetrics//forms//leads//'
    archiveDir = leadDir + 'archive//'
    aggLeads = {}


    # Aggregate all the old individual file
    aggregate_individual_lead_files()

    # Reformat the old aggregate files
    reformat_old_agg_data()

    # Write it all out to an aggregate file
    output_agg_files()


if __name__ == "__main__":
    main()

这是读取的代码：

from p2p.basic import files as f
from p2p.adv import strTools as st
from p2p.basic import strTools as s

import os
import json
import copy
from datetime import datetime
import time


global leadDir
global fields
global fieldTimes
global versions


def parse_agg_file(aggFile):
    global leadDir
    global fields
    global fieldTimes

    try:
        tmp = json.loads(f.file_get_contents(aggFile))
    except ValueError:
        print "{} failed the JSON load".format(aggFile)
        return False

    print "Opening: ", aggFile

    for ts in tmp:
        try:
            tmpTs = float(ts)
        except:
            print "Timestamp: ", ts
            continue

        leadData = tmp[ts]

        for field in leadData:
            if field not in fields:
                fields[field] = []

            fields[field].append(float(ts))


def determine_form_versions():
    global fieldTimes
    global versions

    # Determine all the fields and their start and stop times
    times = []
    for field in fields:
        minTs = min(fields[field])
        fieldTimes[field] = [minTs, max(fields[field])]
        times.append(minTs)
        print 'Min ts: {}'.format(minTs)

    times = set(sorted(times))
    print "Times: ", times
    print "Fields: ", fieldTimes

    versions = {}
    for ts in times:
        d = datetime.fromtimestamp(ts)
        ver = d.strftime("%d_%B_%Y")

        print "Version: ", ver

        versions[ver] = []
        for field in fields:
            if ts in fields[field]:
                versions[ver].append(field)


def main():
    global leadDir
    global fields
    global fieldTimes

    leadDir = 'D://Server Data//eagle805//emmetrics//forms//leads//'
    fields = {}
    fieldTimes = {}

    aggFiles = f.glob(leadDir + '*.lead.agg')

    for aggFile in aggFiles:
        parse_agg_file(aggFile)

    determine_form_versions()

    print "Versions: ", versions




if __name__ == "__main__":
    main()

所以我想通了......我发布这个答案只是为了防止其他人犯同样的错误。

首先，我找到了解决方法，但我不确定为什么会起作用。从我的原始代码来看，这是我的file_get_contents功能：

def file_get_contents(fname):
    if s.stripos(fname, 'http://'):
        import urllib2
        return urllib2.urlopen(fname).read(maxUrlRead)
    else:
        return open(fname).read(maxFileRead)

我通过以下方式使用它：

tmp = json.loads(f.file_get_contents(aggFile))

这失败了，一次又一次。然而，当我试图让 Python 至少给我 JSON 字符串来通过JSON验证器 http://jsonlint.com/我遇到过提到json.load vs json.loads。所以我尝试了这个：

a = open('D://Server Data//eagle805//emmetrics//forms//leads\July_2014.cd.lead.agg')
b = json.load(a)

虽然我尚未在我的整体代码中测试此输出，但该代码块实际上会读取文件、解码 JSON，甚至会显示数据而不会导致 Spyder 崩溃。 Spyder 中的变量资源管理器显示 b 是一个大小为 1465 的字典，这正是它应该有的记录数。字典末尾显示的文本部分看起来都不错。所以总的来说，我对数据的解析有相当高的信心。

当我写下file_get_contents我看到一些建议，我总是提供读取的最大字节数，以防止 Python 挂在错误的返回上。的价值maxReadFile was 1E7。当我手动强制maxReadFile to be 1E9一切都很好。事实证明该文件不足 1.2E7 字节。因此，读取文件得到的字符串不是文件中的完整字符串，因此是无效的 JSON。

通常我认为这是一个错误，但显然在打开和读取文件时，您需要能够一次只读取一个块以进行内存管理。所以我被自己的短视所困扰maxReadFile价值。错误消息是正确的，但让我白费力气。

希望这可以节省其他人一些时间。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

json

python27

Python 和 JSON：ValueError：未终止的字符串始于：的相关文章

发送 POST 请求时 JSON 原语无效

我有以下 ajax 请求其中我尝试将 JSON 对象发送到服务器 function sendData subscriptionJson ajax type POST url Url Action SubscribeSecurities S
Python 2.7 缩进错误[关闭]

Closed 这个问题不符合堆栈溢出指南 help closed questions 目前不接受答案这个问题是由拼写错误或无法再重现的问题引起的虽然类似的问题可能是on topic help on topic在这里这个问题的解决方式不
编码：类型错误：write() 参数必须是 str，而不是 bytes

我对 python 有初步的了解但不清楚处理二进制编码问题我正在尝试运行 firefox webextensions 示例中的示例代码其中 python 脚本发送由 javascript 程序读取的文本我不断遇到编码错误蟒蛇代码是
在单行上获取 jq 的输出

我使用以下输出 https stackoverflow com a 40330344 https stackoverflow com a 40330344 issues key status fields status name assig
使用 json_encode() 函数在 PHP 数组中生成 JSON 键值对

我正在尝试以特定语法获取 JSON 输出这是我的代码 ss array 1 jpg 2 jpg dates array eu gt 59 99 us gt 39 99 array1 array name gt game1 publishe
JSON 数组到 C# 列表

如何将这个简单的 JSON 字符串反序列化为 C 中的列表 on4ThnU7 n71YZYVKD CVfSpM2W 10kQotV 这样 List
Java Jackson：反序列化复杂的多态对象模型：JsonMappingException：意外的标记（START_OBJECT），预期的VALUE_STRING

我有这棵对象树 A B延伸A C延伸B D 延伸 B E延伸C F 扩展 A 并且有一个对 A 的引用 A 有以下注释 JsonTypeInfo use JsonTypeInfo Id CLASS include JsonTypeInfo
nvd3 格式化日期始终返回 1970-01-01

我正在尝试使用构建折线图nvd3 for d3js但我在 x 轴上使用日期域时遇到了一些问题这是我的代码 data lineChart key key1 values x 2014 04 20 y 6 x 2014 04 13 y 5 x
如何为所有 API 端点全局设置 http.ResponseWriter Content-Type 标头？

我是 Go 新手现在正在用它构建一个简单的 API package main import encoding json fmt github com gorilla mux github com gorilla handlers log
比较两个 numpy 数组的最快方法

我有两个数组 gt gt gt import numpy as np gt gt gt a np array 2 1 3 3 3 gt gt gt b np array 1 2 3 3 3 无论顺序如何比较这两个数组的元素是否相等的最快方
为什么微软在 .net 3.5 SP1 之前就废弃了 JavaScriptSerializer，而在 .net 3.5 SP1 之后又重新启用了 JavaScriptSerializer？

JavaScriptSerializer 在 net 3 5 SP1 之后并没有过时我应该使用 JavaScriptSerializer 还是之前推荐的 DataContractJsonSerializer 还有为什么它被淘汰了我很乐意
简单、安全的API认证系统

我有一个简单的 REST JSON API 供其他网站应用程序访问我网站的一些数据库通过 PHP 网关基本上该服务的工作原理如下调用 example com fruit orange 服务器返回有关橙子的 JSON 信息问题是我
使用 PyArg_ParseTuple 解析用户定义类型

如何使用解析用户定义的类型或现有非标准库中的类型 PyArg ParseTuple 而不是使用普通的O格式正如 Martijn 建议的那样我通常更喜欢使用它允许您传递一个函数该函数将被调用以转换任何PyObject 到任意 C 双
如何修复 jq 扁平化 JSON 数组的重复输出

我正在尝试使用 jq 命令压平 JSON 文件但输出是重复的请在这里查看我的jqplay https jqplay org s gwvMIH fed https jqplay org s gwvMIH fed 我的输入 JSON cos
Pycharm 中的 Traitlets.traitlets.TraitError

我是Python的初学者我面临以下问题每当我启动 pycharm 社区版版本 5 0 3 时 Python 控制台无法启动并显示以下错误 usr bin python2 7 usr lib pycharm community help
Spring JSON序列化、Gson反序列化

我目前在某些内部对象的反序列化方面遇到问题在春天我在使用输出之前初始化所有对象 ResponseBody 例如这是一个响应 id 1 location id 1 extra location data id 2 location 1
递归单元测试发现

我有一个带有目录 tests 的包我在其中存储单元测试我的包裹看起来像 LICENSE models init py README md requirements txt tc py tests db test employee py
PowerShell JSON 添加值格式

我正在向 json 文件添加数据我这样做是通过 blockcvalue connectionString server localdb mssqllocaldb Integrated Security true Database data
可以通过 url 发送 JSON 吗？

我有一个 ruby 哈希其中键是 url 值是整数我将哈希值转换为 JSON 我想知道是否能够通过 AJAX 请求在 url 内发送 JSON 然后从 params 哈希值中提取该 JSON 另外我将把 JSON 化的 ruby 哈希
来自 ajax 的 Bootstrap 表 json

我有 ajax 和 bootstrap 表的问题我有一个 ajax JSON 我用这个方法调用 document ready function ajax url php process php method fetchdata dataT

随机推荐

我是否想访问地址零？

常量 0 在 C 和 C 中用作空指针但正如问题中指向特定固定地址的指针 https stackoverflow com questions 2389251 pointer to a specific fixed address分配固定
如何计算 MySQL 查询返回的行数？

如何计算 MySQL 查询返回的行数获取查询结果中的总行数您只需迭代结果并计算它们即可你没有说明你正在使用什么语言或客户端库但 API 确实提供了mysql num rows http dev mysql com doc refma
您是否需要存储 std::async 的 std::future 返回值？

考虑以下代码 include
如何在MDX查询中的行上显示多个维度？

我有一个叫做Sales KG在我的立方体和二维中 Groups and Formats 有没有办法在单行中显示最后一个我有这样的疑问 select Measures Sales KG on Columns Formats Format T
如何理解“协方差”和“逆变”这两句话？

我正在阅读深入Scala 的第一节第一节中有两句话是关于协方差和逆变协变 T 或 extends T 是指类型可以沿着继承层次结构强制向下逆变 T 或 super T 是指类型可以在继承层次结构中强制向上我读过一些有关协方
为什么 Google Mocks 发现这个函数调用不明确？

我在尝试开始使用 Google Mocks 时遇到了问题由于某种原因它无法告诉我在EXPECT CALL宏即使类型是一致的我想知道为什么它不只匹配第一个函数以及我需要做什么添加才能使其匹配第一个函数模拟类 class GMock
jquery，按值取消选择复选框

我有很多复选框
包裹两个相邻的 td

我有一个有两列的表格两列都是 300 像素宽在普通计算机屏幕上宽度为 600 像素我想修改小屏幕移动设备该表格的显示有没有一种 CSS 方法可以使右列的单元格换行并位于左列的单元格下方然后是下一个左侧单元格然后是下一个右侧单元格
自定义 UIPageControl 视图，用“Page X of Y”替换点

我正在尝试找到一种方法用 Page X of Y 的标题替换 UIPageControl 的点因为我可能有 gt 50 个项目我刚刚熟悉 Cocoa 我想知道最好的方法是什么我可以子类化 UIPageControl 吗我应该使用带
RTSP 帧抓取会产生拖尾、像素化和损坏的图像

我正在尝试使用以下命令从 RTSP 流中每秒捕获一帧 ffmpeg i rtsp XXX q v 1 vf fps fps 1 strftime 1 ZZZZ H M S jpg But some of the frames are sme
为什么 Rails (3+) 仍然不支持存储过程？

我熟悉 Ruby on Rails DB MS 驱动程序和存储过程之间长期存在的又爱又恨的关系并且自版本 2 3 2 以来我一直在开发 Rails 应用程序然而每隔一段时间就会出现这样的情况 SP 是比在慢得多的应用程序级别上组合
如何在 XSLT 中打印单个
而不将其关闭

基本上我需要在一个 if 语句中打开一个 div 并在另一个 if 语句中关闭它我试过
在 Visual Studio Code 中打开多个项目/文件夹

如何在单个 Visual Studio Code 实例中打开多个项目文件夹并在单个视图中打开多个文件对于未来的变更请求是否有任何选项不知道为什么没有提到最简单的解决方案你可以简单地做File gt New Window并在新窗口中
手动更改 GUID - 这有多糟糕？

手动更改生成的 GUID 并使用它有多糟糕碰撞的可能性是否仍然微不足道或者使用 GUID 进行操作是否危险有时我们只是更改之前生成的 GUID 的某些字母并使用它我们应该停止这样做吗注意这个答案错过了一些旧的格式并且在不久的将
使用 lerna 时保留关键字“interface”

我有一个使用创建的反应项目create react app我现在正在尝试将其转换为 monorepo 架构我将所有独立代码移至一个包 package1 中并将其余代码以及 App tsx 和 index tsx 移至另一个包 pack
警告 BlockManagerMasterEndpoint：没有更多副本可用于 rdd

当使用 YARN 在 pyspark 中缓存大型数据帧时我看到以下类型的消息 WARN BlockManagerMasterEndpoint No more replicas available for rdd 23 62 这条消息到底是
获取行中的第一个和第二个 td 元素

我有一个 ajax 调用附加到表行内图片的单击事件单击图片并启动单击事件后我需要获取第一个和第二个td该行中的元素我是 jQuery 新手所以下面是我的最新尝试不起作用变量firstName and lastName两者最终都是
如何模拟ResourceBundle.getString()？

我嘲笑失败ResourceBundle getString 这是我的代码 ResourceBundle schemaBundle Mockito mock ResourceBundle class Mockito when schemaBu
如果使用 AngularJS 更改输入，如何向输入添加类？

我在表单中编写了以下代码 td td
Python 和 JSON：ValueError：未终止的字符串始于：

我读过多篇关于此问题的 StackOverflow 文章以及大部分 Google 前 10 名结果我的问题的不同之处在于我使用 python 中的一个脚本来创建 JSON 文件不到 10 分钟后运行的下一个脚本无法读取该文件简而言之

Python 和 JSON：ValueError：未终止的字符串始于：

Python 和 JSON：ValueError：未终止的字符串始于： 的相关文章

随机推荐

热门标签

Python 和 JSON：ValueError：未终止的字符串始于：的相关文章