我注意到一个问题,在评估()期间,我没有看到基于fit()结果的预期结果。我在网上发现了很多讨论,人们都有类似的问题。例如,this https://github.com/keras-team/keras/issues/6977开放问题讨论了 dropout 层和批量标准化作为可能的原因,但也有人注意到可能存在与 dropout 和批量标准化不同的问题。对于初学者来说,甚至很难知道问题到底是什么。
我正在使用的网络架构确实包含批量标准化,但我不确定这是否是问题所在。
该演示的数据可以下载here https://drive.google.com/file/d/1wQZbCuw8cI9cyZIKz956wNLfgfz-o3c3/view?usp=sharing.
该脚本清楚地说明了我遇到的问题:
import random
import os
import matplotlib.image as mpimg
import cv2
import tensorflow as tf
tf.compat.v1.enable_eager_execution()
HEIGHT_WIDTH = 299
BATCH_SIZE = 10
VERBOSE = 2
SANITY_SWITCH = False
print('starting script')
net = tf.keras.applications.InceptionResNetV2(
include_top=True,
weights=None, # 'imagenet',
input_tensor=None,
input_shape=None,
pooling=None,
classes=2, # 1000,
classifier_activation='softmax'
)
print_output = True
def utility_metric(y_true, y_pred):
global print_output
if print_output:
print(f'y_true:{y_true.numpy()}')
print(f'y_pred:{y_pred.numpy()}')
print_output = False
return 0
net.compile(
optimizer='ADAM',
loss='sparse_categorical_crossentropy',
metrics=['accuracy', utility_metric]
)
net.run_eagerly = True
class_map = {'dog': 0, 'cat': 1}
def preprocess(file):
imdata = mpimg.imread(file)
imdata = cv2.resize(imdata, dsize=(HEIGHT_WIDTH, HEIGHT_WIDTH), interpolation=cv2.INTER_LINEAR)
imdata.shape = (HEIGHT_WIDTH, HEIGHT_WIDTH, 3)
imdata /= 127.5
imdata -= 1.
return imdata, class_map[os.path.basename(os.path.dirname(file))]
train_data = [f'data/Training/cat/{x}' for x in os.listdir('data/Training/cat')] + [f'data/Training/dog/{x}' for x in os.listdir('data/Training/dog')]
test_data = [f'data/Testing/cat/{x}' for x in os.listdir('data/Testing/cat')] + [f'data/Testing/dog/{x}' for x in os.listdir('data/Testing/dog')]
random.shuffle(train_data)
random.shuffle(test_data)
if SANITY_SWITCH:
tmp_data = train_data
train_data = test_data
test_data = tmp_data
def get_gen(data):
def gen():
pairs = []
i = 0
for im_file in data:
i += 1
if i <= BATCH_SIZE:
pairs += [preprocess(im_file)]
if i == BATCH_SIZE:
yield (
[pair[0] for pair in pairs],
[pair[1] for pair in pairs]
)
pairs.clear()
i = 0
return gen
def get_ds(data):
return tf.data.Dataset.from_generator(
get_gen(data),
(tf.float32, tf.int64),
output_shapes=(
tf.TensorShape((BATCH_SIZE, HEIGHT_WIDTH, HEIGHT_WIDTH, 3)),
tf.TensorShape(([BATCH_SIZE]))
)
)
print('starting training')
net.fit(
get_ds(train_data),
epochs=5,
verbose=VERBOSE,
use_multiprocessing=True,
workers=16,
batch_size=BATCH_SIZE,
shuffle=False
)
print('starting testing')
print_output = True
net.evaluate(
get_ds(test_data),
verbose=VERBOSE,
batch_size=BATCH_SIZE,
use_multiprocessing=True,
workers=16,
)
print('script complete')
完整的输出在这里:
starting script
2020-12-22 15:29:33.896474: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-12-22 15:29:34.184215: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:04:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-12-22 15:29:34.186083: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties:
pciBusID: 0000:05:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-12-22 15:29:34.188086: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 2 with properties:
pciBusID: 0000:08:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-12-22 15:29:34.190088: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 3 with properties:
pciBusID: 0000:09:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-12-22 15:29:34.192124: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 4 with properties:
pciBusID: 0000:84:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-12-22 15:29:34.194144: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 5 with properties:
pciBusID: 0000:85:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-12-22 15:29:34.196095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 6 with properties:
pciBusID: 0000:88:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-12-22 15:29:34.197451: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 7 with properties:
pciBusID: 0000:89:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-12-22 15:29:34.208178: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-12-22 15:29:34.301110: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-12-22 15:29:34.348641: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-12-22 15:29:34.370185: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-12-22 15:29:34.459524: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-12-22 15:29:34.471473: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-12-22 15:29:34.599447: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-12-22 15:29:34.634806: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
2020-12-22 15:29:34.635371: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2020-12-22 15:29:34.680254: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2000105000 Hz
2020-12-22 15:29:34.687348: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x561e331d4820 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-12-22 15:29:34.687415: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-12-22 15:29:35.617673: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:04:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-12-22 15:29:35.619368: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties:
pciBusID: 0000:05:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-12-22 15:29:35.621161: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 2 with properties:
pciBusID: 0000:08:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-12-22 15:29:35.622953: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 3 with properties:
pciBusID: 0000:09:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-12-22 15:29:35.624745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 4 with properties:
pciBusID: 0000:84:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-12-22 15:29:35.626508: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 5 with properties:
pciBusID: 0000:85:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-12-22 15:29:35.628264: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 6 with properties:
pciBusID: 0000:88:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-12-22 15:29:35.629460: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 7 with properties:
pciBusID: 0000:89:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-12-22 15:29:35.629581: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-12-22 15:29:35.629633: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-12-22 15:29:35.629685: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-12-22 15:29:35.629733: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-12-22 15:29:35.629788: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-12-22 15:29:35.629837: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-12-22 15:29:35.629886: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-12-22 15:29:35.657298: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
2020-12-22 15:29:35.659638: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-12-22 15:29:35.678371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-12-22 15:29:35.678447: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108] 0 1 2 3 4 5 6 7
2020-12-22 15:29:35.678500: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0: N Y Y Y N N N N
2020-12-22 15:29:35.678538: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 1: Y N Y Y N N N N
2020-12-22 15:29:35.678569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 2: Y Y N Y N N N N
2020-12-22 15:29:35.678597: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 3: Y Y Y N N N N N
2020-12-22 15:29:35.678624: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 4: N N N N N Y Y Y
2020-12-22 15:29:35.678652: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 5: N N N N Y N Y Y
2020-12-22 15:29:35.678678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 6: N N N N Y Y N Y
2020-12-22 15:29:35.678705: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 7: N N N N Y Y Y N
2020-12-22 15:29:35.703703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10689 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:04:00.0, compute capability: 3.7)
2020-12-22 15:29:35.711407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 8534 MB memory) -> physical GPU (device: 1, name: Tesla K80, pci bus id: 0000:05:00.0, compute capability: 3.7)
2020-12-22 15:29:35.716593: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10689 MB memory) -> physical GPU (device: 2, name: Tesla K80, pci bus id: 0000:08:00.0, compute capability: 3.7)
2020-12-22 15:29:35.721879: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10689 MB memory) -> physical GPU (device: 3, name: Tesla K80, pci bus id: 0000:09:00.0, compute capability: 3.7)
2020-12-22 15:29:35.726952: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:4 with 10689 MB memory) -> physical GPU (device: 4, name: Tesla K80, pci bus id: 0000:84:00.0, compute capability: 3.7)
2020-12-22 15:29:35.732126: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:5 with 10689 MB memory) -> physical GPU (device: 5, name: Tesla K80, pci bus id: 0000:85:00.0, compute capability: 3.7)
2020-12-22 15:29:35.736838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:6 with 10689 MB memory) -> physical GPU (device: 6, name: Tesla K80, pci bus id: 0000:88:00.0, compute capability: 3.7)
2020-12-22 15:29:35.740357: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:7 with 108 MB memory) -> physical GPU (device: 7, name: Tesla K80, pci bus id: 0000:89:00.0, compute capability: 3.7)
2020-12-22 15:29:35.746472: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x561e387dea00 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-12-22 15:29:35.746517: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2020-12-22 15:29:35.746537: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): Tesla K80, Compute Capability 3.7
2020-12-22 15:29:35.746577: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (2): Tesla K80, Compute Capability 3.7
2020-12-22 15:29:35.746594: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (3): Tesla K80, Compute Capability 3.7
2020-12-22 15:29:35.746614: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (4): Tesla K80, Compute Capability 3.7
2020-12-22 15:29:35.746645: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (5): Tesla K80, Compute Capability 3.7
2020-12-22 15:29:35.746664: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (6): Tesla K80, Compute Capability 3.7
2020-12-22 15:29:35.746694: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (7): Tesla K80, Compute Capability 3.7
starting training
Epoch 1/5
2020-12-22 15:29:48.307104: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-12-22 15:29:51.694232: W tensorflow/stream_executor/gpu/asm_compiler.cc:81] Running ptxas --version returned 256
2020-12-22 15:29:51.796020: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: ptxas exited with non-zero error code 256, output:
Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
This message will be only logged once.
2020-12-22 15:29:52.577156: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
y_true:[[1.]
[1.]
[0.]
[1.]
[1.]
[1.]
[1.]
[0.]
[1.]
[1.]]
y_pred:[[0.58956003 0.41043994]
[0.63762885 0.36237112]
[0.53731585 0.46268415]
[0.5393683 0.4606317 ]
[0.90735996 0.09264001]
[0.552977 0.44702297]
[0.7115651 0.28843486]
[0.4068687 0.59313136]
[0.5482196 0.4517804 ]
[0.4330527 0.56694734]]
72/72 - 81s - loss: 0.9134 - accuracy: 0.5417 - utility_metric: 0.0000e+00
Epoch 2/5
72/72 - 81s - loss: 0.7027 - accuracy: 0.5847 - utility_metric: 0.0000e+00
Epoch 3/5
72/72 - 83s - loss: 0.6851 - accuracy: 0.5819 - utility_metric: 0.0000e+00
Epoch 4/5
72/72 - 83s - loss: 0.6810 - accuracy: 0.5944 - utility_metric: 0.0000e+00
Epoch 5/5
72/72 - 83s - loss: 0.6895 - accuracy: 0.5625 - utility_metric: 0.0000e+00
starting testing
y_true:[[1.]
[1.]
[0.]
[0.]
[0.]
[1.]
[1.]
[0.]
[0.]
[1.]]
y_pred:[[0.39538118 0.6046188 ]
[0.39505056 0.6049495 ]
[0.39406297 0.605937 ]
[0.3947329 0.60526717]
[0.3935887 0.60641134]
[0.39452523 0.60547477]
[0.39451653 0.6054835 ]
[0.39475334 0.60524666]
[0.39559898 0.604401 ]
[0.3951175 0.60488254]]
90/90 - 37s - loss: 0.7157 - accuracy: 0.5000 - utility_metric: 0.0000e+00
script complete
输出中需要关注的部分是准确性:
训练纪元 1:0.5417
训练周期 2:0.5847
训练纪元 3:0.5819
训练纪元 4:0.5944
训练第 5 纪元:0.5625
评价:0.5000
我还在两种情况下包含了网络的原始输出。
训练时一:
y_true:[[1.]
[1.]
[0.]
[1.]
[1.]
[1.]
[1.]
[0.]
[1.]
[1.]]
y_pred:[[0.58956003 0.41043994]
[0.63762885 0.36237112]
[0.53731585 0.46268415]
[0.5393683 0.4606317 ]
[0.90735996 0.09264001]
[0.552977 0.44702297]
[0.7115651 0.28843486]
[0.4068687 0.59313136]
[0.5482196 0.4517804 ]
[0.4330527 0.56694734]]
还有一个在测试期间:
y_true:[[1.]
[1.]
[0.]
[0.]
[0.]
[1.]
[1.]
[0.]
[0.]
[1.]]
y_pred:[[0.39538118 0.6046188 ]
[0.39505056 0.6049495 ]
[0.39406297 0.605937 ]
[0.3947329 0.60526717]
[0.3935887 0.60641134]
[0.39452523 0.60547477]
[0.39451653 0.6054835 ]
[0.39475334 0.60524666]
[0.39559898 0.604401 ]
[0.3951175 0.60488254]]
我发现令人困惑的是,为什么在测试过程中,图像之间的输出变化似乎很小。这似乎与问题的根源有关,但我不知道是什么原因造成的。
我已经运行这个脚本很多次了,有些事情是一致的。评估过程中的准确性始终是完全偶然的。在评估期间 y_pred 始终存在较低的变化,并且所有输出似乎都是相同的标签(因此,例如,在评估期间,模型可能会将每个输入图像报告为“狗”)。
有时在训练期间,准确率会超过 60%。这并不影响问题。我可以继续增加数据集的大小和时期数,并尝试改进训练结果,但我担心在不首先理解为什么评估结果像现在这样奇怪的情况下继续前进。