我试图通过对 Databricks 中的 pyspark 数据框应用余弦相似度来查找文本列(“标题”、“标题”)的相似性。我的函数称为“cosine_sim_udf”,为了能够使用它,我必须进行第一次 udf 转换。
将函数应用于 df 后出现查找错误。有谁知道原因或有解决方案建议吗?
我的函数是寻找余弦相似度;
nltk.download('punkt')
stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
'''remove punctuation, lowercase, stem'''
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
def cosine_sim(text1, text2):
tfidf = vectorizer.fit_transform([text1, text2])
return float(((tfidf * tfidf.T).A)[0,1])
cosine_sim_udf = udf(cosine_sim, FloatType())
df2 = df.withColumn('cosine_distance', cosine_sim_udf('title', 'headline')) # title and headline are text to find similarities
然后我得到这个错误;
PythonException: 'LookupError:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 426.0 failed 4 times, most recent failure: Lost task 0.3 in stage 426.0 (TID 2135) (10.109.245.129 executor 1): org.apache.spark.api.python.PythonException: 'LookupError:
**********************************************************************
Resource [93mpunkt[0m not found.
Please use the NLTK Downloader to obtain the resource:
[31m>>> import nltk
>>> nltk.download('punkt')
[0m
For more information see: https://www.nltk.org/data.html
Attempted to load [93mtokenizers/punkt/PY3/english.pickle[0m
Searched in:
- '/root/nltk_data'
- '/databricks/python/nltk_data'
- '/databricks/python/share/nltk_data'
- '/databricks/python/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'