正如您提到的,一种解决方案是对分类数据进行 one-hot 编码(或者甚至以基于索引的格式按原样使用它们),并将它们与数值数据一起馈送到 LSTM 层。当然,这里也可以有两个 LSTM 层,一个用于处理数值数据,另一个用于处理分类数据(采用单热编码格式或基于索引的格式),然后合并它们的输出。
另一种解决方案是为每个分类数据设置一个单独的嵌入层。每个嵌入层可能有自己的嵌入维度(正如上面所建议的,您可能有多个 LSTM 层来分别处理数值和分类特征):
num_cats = 3 # number of categorical features
n_steps = 100 # number of timesteps in each sample
n_numerical_feats = 10 # number of numerical features in each sample
cat_size = [1000, 500, 100] # number of categories in each categorical feature
cat_embd_dim = [50, 10, 100] # embedding dimension for each categorical feature
numerical_input = Input(shape=(n_steps, n_numerical_feats), name='numeric_input')
cat_inputs = []
for i in range(num_cats):
cat_inputs.append(Input(shape=(n_steps,1), name='cat' + str(i+1) + '_input'))
cat_embedded = []
for i in range(num_cats):
embed = TimeDistributed(Embedding(cat_size[i], cat_embd_dim[i]))(cat_inputs[i])
cat_embedded.append(embed)
cat_merged = concatenate(cat_embedded)
cat_merged = Reshape((n_steps, -1))(cat_merged)
merged = concatenate([numerical_input, cat_merged])
lstm_out = LSTM(64)(merged)
model = Model([numerical_input] + cat_inputs, lstm_out)
model.summary()
以下是模型摘要:
Layer (type) Output Shape Param # Connected to
==================================================================================================
cat1_input (InputLayer) (None, 100, 1) 0
__________________________________________________________________________________________________
cat2_input (InputLayer) (None, 100, 1) 0
__________________________________________________________________________________________________
cat3_input (InputLayer) (None, 100, 1) 0
__________________________________________________________________________________________________
time_distributed_1 (TimeDistrib (None, 100, 1, 50) 50000 cat1_input[0][0]
__________________________________________________________________________________________________
time_distributed_2 (TimeDistrib (None, 100, 1, 10) 5000 cat2_input[0][0]
__________________________________________________________________________________________________
time_distributed_3 (TimeDistrib (None, 100, 1, 100) 10000 cat3_input[0][0]
__________________________________________________________________________________________________
concatenate_1 (Concatenate) (None, 100, 1, 160) 0 time_distributed_1[0][0]
time_distributed_2[0][0]
time_distributed_3[0][0]
__________________________________________________________________________________________________
numeric_input (InputLayer) (None, 100, 10) 0
__________________________________________________________________________________________________
reshape_1 (Reshape) (None, 100, 160) 0 concatenate_1[0][0]
__________________________________________________________________________________________________
concatenate_2 (Concatenate) (None, 100, 170) 0 numeric_input[0][0]
reshape_1[0][0]
__________________________________________________________________________________________________
lstm_1 (LSTM) (None, 64) 60160 concatenate_2[0][0]
==================================================================================================
Total params: 125,160
Trainable params: 125,160
Non-trainable params: 0
__________________________________________________________________________________________________
然而,您可以尝试另一种解决方案:只为所有分类特征使用一个嵌入层。不过,它涉及一些预处理:您需要重新索引所有类别以使它们彼此不同。例如,第一个分类特征中的类别将从 1 到size_first_cat
然后第二个分类特征中的类别将从size_first_cat + 1
to size_first_cat + size_second_cat
等等。然而,在此解决方案中,所有分类特征都将具有相同的嵌入维度,因为我们仅使用一个嵌入层。
Update:现在我想了想,你还可以在数据预处理阶段甚至模型中重塑分类特征来摆脱TimeDistributed
层和Reshape
层(这也可能会提高训练速度):
numerical_input = Input(shape=(n_steps, n_numerical_feats), name='numeric_input')
cat_inputs = []
for i in range(num_cats):
cat_inputs.append(Input(shape=(n_steps,), name='cat' + str(i+1) + '_input'))
cat_embedded = []
for i in range(num_cats):
embed = Embedding(cat_size[i], cat_embd_dim[i])(cat_inputs[i])
cat_embedded.append(embed)
cat_merged = concatenate(cat_embedded)
merged = concatenate([numerical_input, cat_merged])
lstm_out = LSTM(64)(merged)
model = Model([numerical_input] + cat_inputs, lstm_out)
至于拟合模型,您需要分别为每个输入层提供其对应的 numpy 数组,例如:
X_tr_numerical = X_train[:,:,:n_numerical_feats]
# extract categorical features: you can use a for loop to this as well.
# note that we reshape categorical features to make them consistent with the updated solution
X_tr_cat1 = X_train[:,:,cat1_idx].reshape(-1, n_steps)
X_tr_cat2 = X_train[:,:,cat2_idx].reshape(-1, n_steps)
X_tr_cat3 = X_train[:,:,cat3_idx].reshape(-1, n_steps)
# don't forget to compile the model ...
# fit the model
model.fit([X_tr_numerical, X_tr_cat1, X_tr_cat2, X_tr_cat3], y_train, ...)
# or you can use input layer names instead
model.fit({'numeric_input': X_tr_numerical,
'cat1_input': X_tr_cat1,
'cat2_input': X_tr_cat2,
'cat3_input': X_tr_cat3}, y_train, ...)
如果您想使用fit_generator()
没有区别:
# if you are using a generator
def my_generator(...):
# prep the data ...
yield [batch_tr_numerical, batch_tr_cat1, batch_tr_cat2, batch_tr_cat3], batch_tr_y
# or use the names
yield {'numeric_input': batch_tr_numerical,
'cat1_input': batch_tr_cat1,
'cat2_input': batch_tr_cat2,
'cat3_input': batch_tr_cat3}, batch_tr_y
model.fit_generator(my_generator(...), ...)
# or if you are subclassing Sequence class
class MySequnece(Sequence):
def __init__(self, x_set, y_set, batch_size):
# initialize the data
def __getitem__(self, idx):
# fetch data for the given batch index (i.e. idx)
# same as the generator above but use `return` instead of `yield`
model.fit_generator(MySequence(...), ...)