我已经使用 joblib 保存了分类器管道:
vec = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3))
pac_clf = PassiveAggressiveClassifier(C=1)
vec_clf = Pipeline([('vectorizer', vec), ('pac', pac_clf)])
vec_clf.fit(X_train,y_train)
joblib.dump(vec_clf, 'class.pkl', compress=9)
现在我尝试在生产环境中使用它:
def classify(title):
#load classifier and predict
classifier = joblib.load('class.pkl')
#vectorize/transform the new title then predict
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3))
X_test = vectorizer.transform(title)
predict = classifier.predict(X_test)
return predict
我收到的错误是: ValueError:词汇未安装或为空!
我想我应该从 joblid 加载词汇表,但我无法让它工作
只需替换:
#load classifier and predict
classifier = joblib.load('class.pkl')
#vectorize/transform the new title then predict
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3))
X_test = vectorizer.transform(title)
predict = classifier.predict(X_test)
return predict
by:
# load the saved pipeline that includes both the vectorizer
# and the classifier and predict
classifier = joblib.load('class.pkl')
predict = classifier.predict(X_test)
return predict
class.pkl
包括完整的管道,无需创建新的矢量化器实例。正如错误消息所示,您需要重用首先训练的向量化器,因为从标记(字符串 ngram)到列索引的特征映射保存在向量化器本身中。这种映射被称为“词汇表”。
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)