Context
在这篇文章中:
ConvertFrom-Json 大文件 https://stackoverflow.com/q/76784490/268581
我询问有关反序列化 1.2GB JSON 文件的问题。
这个答案发布在那里:
https://stackoverflow.com/a/76791900/268581 https://stackoverflow.com/a/76791900/268581
确实有效,但速度非常慢。
样本数据
这样您就不必使用 1.2GB 文件,这里有一个用于解决此问题的小数据示例。这只是原始大型 JSON 文件中的前几项。
example.json
:
[{"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:AMD230728C00115000", "exchange": 304, "id": null, "tape": null, "price": 0.38, "size": 1, "conditions": [227], "timestamp": 1690471217275, "sequence_number": 1477738810, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:AFRM230728C00019500", "exchange": 302, "id": null, "tape": null, "price": 0.07, "size": 10, "conditions": [209], "timestamp": 1690471217278, "sequence_number": 1477739110, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 325, "id": null, "tape": null, "price": 4.8, "size": 7, "conditions": [219], "timestamp": 1690471217282, "sequence_number": 341519150, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 312, "id": null, "tape": null, "price": 4.8, "size": 1, "conditions": [209], "timestamp": 1690471217282, "sequence_number": 341519166, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 312, "id": null, "tape": null, "price": 4.8, "size": 1, "conditions": [209], "timestamp": 1690471217282, "sequence_number": 341519167, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 319, "id": null, "tape": null, "price": 4.8, "size": 5, "conditions": [219], "timestamp": 1690471217282, "sequence_number": 341519170, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 312, "id": null, "tape": null, "price": 4.8, "size": 19, "conditions": [209], "timestamp": 1690471217284, "sequence_number": 341519682, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 301, "id": null, "tape": null, "price": 4.8, "size": 2, "conditions": [219], "timestamp": 1690471217290, "sequence_number": 341519926, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 301, "id": null, "tape": null, "price": 4.8, "size": 15, "conditions": [219], "timestamp": 1690471217290, "sequence_number": 341519927, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:META230728C00315000", "exchange": 302, "id": null, "tape": null, "price": 4.76, "size": 1, "conditions": [227], "timestamp": 1690471217323, "sequence_number": 1290750877, "trf_id": null, "trf_timestamp": null}]
Code
这是有效的(缓慢的)代码。 1.2GB 文件运行需要几个小时。
$path = ".\example.json"
$stream = [System.IO.File]::Open($path, [System.IO.FileMode]::Open)
$i = 0
$stream.ReadByte() # read '['
$i++
$json = ''
$data = @()
while ($i -lt $stream.Length)
{
$byte = $stream.ReadByte(); $i++
$char = [Convert]::ToChar($byte)
if ($char -eq '}')
{
$json = $json + [Convert]::ToChar($byte)
$data = $data + ($json | ConvertFrom-Json)
$json = ''
$stream.ReadByte() | Out-Null # read comma;
$i++
if ($data.Count % 100 -eq 0)
{
Write-Host $data.Count
}
}
else
{
$json = $json + [Convert]::ToChar($byte)
}
}
$stream.Close()
运行后,你应该有记录$data
:
PS C:\Users\dharm\Dropbox\Documents\polygon-io.ps1> $data | ft *
py/object event_type symbol exchange id tape price size conditions timestamp sequence_number trf_id trf_timestamp
--------- ---------- ------ -------- -- ---- ----- ---- ---------- --------- --------------- ------ -------------
polygon.websocket.models.models.EquityTrade T O:AMD230728C00115000 304 0.38 1 {227} 1690471217275 1477738810
polygon.websocket.models.models.EquityTrade T O:AFRM230728C00019500 302 0.07 10 {209} 1690471217278 1477739110
polygon.websocket.models.models.EquityTrade T O:TSLA230804C00270000 325 4.8 7 {219} 1690471217282 341519150
polygon.websocket.models.models.EquityTrade T O:TSLA230804C00270000 312 4.8 1 {209} 1690471217282 341519166
polygon.websocket.models.models.EquityTrade T O:TSLA230804C00270000 312 4.8 1 {209} 1690471217282 341519167
polygon.websocket.models.models.EquityTrade T O:TSLA230804C00270000 319 4.8 5 {219} 1690471217282 341519170
polygon.websocket.models.models.EquityTrade T O:TSLA230804C00270000 312 4.8 19 {209} 1690471217284 341519682
polygon.websocket.models.models.EquityTrade T O:TSLA230804C00270000 301 4.8 2 {219} 1690471217290 341519926
polygon.websocket.models.models.EquityTrade T O:TSLA230804C00270000 301 4.8 15 {219} 1690471217290 341519927
polygon.websocket.models.models.EquityTrade T O:META230728C00315000 302 4.76 1 {227} 1690471217323 1290750877
Question
有什么好方法可以提高效率?
Notes
这个答案:
https://stackoverflow.com/a/43747641/268581 https://stackoverflow.com/a/43747641/268581
确实说明了使用 Newtonsoft Json.NET 的 C# 方法。
这是它的代码:
JsonSerializer serializer = new JsonSerializer();
MyObject o;
using (FileStream s = File.Open("bigfile.json", FileMode.Open))
using (StreamReader sr = new StreamReader(s))
using (JsonReader reader = new JsonTextReader(sr))
{
while (reader.Read())
{
// deserialize only when there's "{" character in the stream
if (reader.TokenType == JsonToken.StartObject)
{
o = serializer.Deserialize<MyObject>(reader);
}
}
}
一种方法是下载 Newtonsoft Json.NET DLL,并将其转换为 PowerShell。一个挑战是这一行:
o = serializer.Deserialize<MyObject>(reader);
正如您所看到的,它正在进行通用方法调用。我不清楚如何将其转换为 Windows PowerShell 5.1。
仅依赖于本机 JSON 反序列化库的解决方案将是首选,但如有必要,Newtonsoft 方法也是可以接受的。