我在 Rails 应用程序中使用 Ruby 中的内置 CSV 函数。我正在调用一个 URL(通过 HTTParty)来解析它,并尝试将结果保存到我的数据库中。
问题是,我收到错误Unquoted fields do not allow \r or \n
这通常表明输入数据有问题,但在检查数据时,我找不到任何问题。
以下是我检索数据的方法:
response = HTTParty.get("http://" + "weather.com/ads.txt", limit: 100, follow_redirects: true, timeout: 10)
(此数据可在网址weather.com/ads.txt 上公开获取)
然后,我尝试解析数据,并应用一些正则表达式来忽略 a 之后的所有内容#
,忽略空行等。
if response.code == 200 && !response.body.match(/<.*html>/)
active_policies = []
CSV.parse(response.body, skip_blanks: true, skip_lines: /(^\s*#|^\s*$|^contact=|^CONTACT=|^subdomain=)/) do |row|
begin
#print out the individual ads.txt records
puts ""
print row[0].downcase.strip + " " + row[1].strip + " " +
row[2].split("#").first.strip
active_policies.push(
publisher.policies.find_or_create_by(ad_partner: row[0].downcase.strip, external_seller_id: row[1].strip, seller_relationship: row[2].split("#").first.strip) do |policy|
policy.deactivated_at = nil
end
)
rescue => save
#Add error event to the new sync status model
puts "we are in the loop"
puts save.message, row.inspect, save.backtrace
next
end
end
#else
#puts "Too many policies. Skipping " + publisher.name
#end
#now we are going to run a check to see if we have any policies that are outdated, and if so, flag them as such.
deactivated_policies = publisher.policies.where.not(id: active_policies.map(&:id)).where(deactivated_at: nil)
deactivated_policies.update_all(deactivated_at: Time.now)
deactivated_policies.each do |deactivated_policy|
puts "Deactivating Policy for " + deactivated_policy.publisher.name
end
elsif response.code == 404
print
print response.code.to_s + " GET, " + response.body.size.to_s + " body, "
puts response.headers.size.to_s + " headers for " + publisher.name
elsif response.code == 302
print response.code.to_s + " GET, " + publisher.name
else
puts response.code.to_s + " GET ads txt not found on " + publisher.name
end
publisher.update(last_scan: Time.now)
rescue => ex
puts ex.message, ex.backtrace, "error pulling #{publisher.name} ..."
#publisher.update_columns(active: "false")
end
end`
我的一些想法/调查结果:
-
我尝试逐行查看此内容,并确定第 134 行是破坏扫描的原因。我通过手动检查来做到这一点,如下所示:CSV.parse(response.body.lines[140..400].join("\n"), skip_blanks: true, skip_lines: /(^\s*#|^\s*$|^contact=|^CONTACT=|^subdomain=)/)
但这对我没有帮助,因为即使我将第 134 行识别为违规行,我也不知道如何检测或处理它。
我注意到源文件(位于weather.com/ads.txt)有不寻常的字符,但甚至通过强制它为utf-8response.body.force_encoding("UTF-8")
仍然抛出错误。
我尝试添加next
到救援块,所以即使它发现错误,它也会移动到 csv 中的下一行,但这不会发生 - 它只是出错并停止解析 - 所以我得到前 130~ 条目,但是不是剩下的。
与页面类型类似,我不确定页面类型是 HTML 而不是文本文件是否会产生问题。
我很想知道如何检测和处理此错误,因此非常欢迎这里的任何想法!
以供参考,#PBS
显然是源文件中给我带来麻烦的第 134 行,但我不知道我是否完全相信这就是问题所在。
#canada
google.com, pub-0942427266003794, DIRECT, f08c47fec0942fa0
indexexchange.com, 184315, DIRECT
indexexchange.com, 184601, DIRECT
indexexchange.com, 182960, DIRECT
openx.com, 539462051, DIRECT, 6a698e2ec38604c6
#spain
#PBS
google.com, pub-8750086020675820, DIRECT, f08c47fec0942fa0
google.com, pub-1072712229542583, DIRECT, f08c47fec0942fa0
appnexus.com, 3872, DIRECT
rubiconproject.com, 9778, DIRECT, 0bfd66d529a55807
openx.com, 539967419, DIRECT, 6a698e2ec38604c6
openx.com, 539726051, DIRECT, 6a698e2ec38604c6
google.com, pub-7442858011436823, DIRECT, f08c47fec0942fa0