引言
基于Coursera课程数据集,将课程名称向量化,计算与目标课程标题向量最相似的课程向量,实现基于内容的课程推荐。
代码实现
准备实验环境与数据
import numpy as np
import pandas as pd
from statistics import harmonic_mean
from langdetect import detect
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score
from sklearn.metrics.pairwise import cosine_similarity
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
输出:/kaggle/input/coursera-course-dataset/coursea_data.csv
展示原始数据
df = pd.read_csv('/kaggle/input/coursera-course-dataset/coursea_data.csv')
df.drop(['Unnamed: 0', 'course_organization'], axis=1, inplace=True)
df
|
course_title |
course_Certificate_type |
course_rating |
course_difficulty |
course_students_enrolled |
0 |
(ISC)² Systems Security Certified Practitioner… |
SPECIALIZATION |
4.7 |
Beginner |
5.3k |
1 |
A Crash Course in Causality: Inferring Causal… |
COURSE |
4.7 |
Intermediate |
17k |
2 |
A Crash Course in Data Science |
COURSE |
4.5 |
Mixed |
130k |
3 |
A Law Student’s Toolkit |
COURSE |
4.7 |
Mixed |
91k |
4 |
A Life of Happiness and Fulfillment |
COURSE |
4.8 |
Mixed |
320k |
… |
… |
… |
… |
… |
… |
将注册人数属性转换为数值类型
df = df[df.course_students_enrolled.str.endswith('k')]
df['course_students_enrolled'] = df['course_students_enrolled'].apply(lambda enrolled : eval(enrolled[:-1]) * 1000)
df
|
course_title |
course_Certificate_type |
course_rating |
course_difficulty |
course_students_enrolled |
0 |
(ISC)² Systems Security Certified Practitioner… |
SPECIALIZATION |
4.7 |
Beginner |
5300.0 |
1 |
A Crash Course in Causality: Inferring Causal… |
COURSE |
4.7 |
Intermediate |
17000.0 |
2 |
A Crash Course in Data Science |
COURSE |
4.5 |
Mixed |
130000.0 |
3 |
A Law Student’s Toolkit |
COURSE |
4.7 |
Mixed |
91000.0 |
4 |
A Life of Happiness and Fulfillment |
COURSE |
4.8 |
Mixed |
320000.0 |
… |
… |
… |
… |
… |
… |
数据归一化
minmax_scaler = MinMaxScaler()
scaled_ratings = minmax_scaler.fit_transform(df[['course_rating','course_students_enrolled']])
df['course_rating'] = scaled_ratings[:,0]
df['course_students_enrolled'] = scaled_ratings[:,1]
df['overall_rating'] = df[['course_rating','course_students_enrolled']].apply(lambda row : harmonic_mean(row), axis=1)
df
|
course_title |
course_Certificate_type |
course_rating |
course_difficulty |
course_students_enrolled |
overall_rating |
0 |
(ISC)² Systems Security Certified Practitioner… |
SPECIALIZATION |
0.823529 |
Beginner |
0.004587 |
0.009122 |
1 |
A Crash Course in Causality: Inferring Causal… |
COURSE |
0.823529 |
Intermediate |
0.018709 |
0.036586 |
2 |
A Crash Course in Data Science |
COURSE |
0.705882 |
Mixed |
0.155100 |
0.254319 |
3 |
A Law Student’s Toolkit |
COURSE |
0.823529 |
Mixed |
0.108027 |
0.190999 |
4 |
A Life of Happiness and Fulfillment |
COURSE |
0.882353 |
Mixed |
0.384430 |
0.535534 |
… |
… |
… |
… |
… |
… |
… |
产生推荐结果
df = df[df.course_title.apply(lambda title : detect(title) == 'en')]
vectorizer = TfidfVectorizer(stop_words='english')
vectors = vectorizer.fit_transform(df.course_title)
def recommend_by_course_title (title, recomm_count=10) :
title_vector = vectorizer.transform([title])
cosine_sim = cosine_similarity(vectors, title_vector)
idx = np.argsort(np.array(cosine_sim[:,0]))[-recomm_count:]
sdf = df.iloc[idx].sort_values(by='overall_rating', ascending=False)
return sdf
recommend_by_course_title('A Crash Course in Data Science')
|
course_title |
course_Certificate_type |
course_rating |
course_difficulty |
course_students_enrolled |
overall_rating |
487 |
Introduction to Data Science in Python |
COURSE |
0.705882 |
Intermediate |
0.468920 |
0.563503 |
486 |
Introduction to Data Science |
SPECIALIZATION |
0.764706 |
Beginner |
0.372360 |
0.500843 |
864 |
What is Data Science? |
COURSE |
0.823529 |
Beginner |
0.312010 |
0.452559 |
54 |
Applied Data Science |
SPECIALIZATION |
0.764706 |
Beginner |
0.263730 |
0.392199 |
711 |
SQL for Data Science |
COURSE |
0.764706 |
Beginner |
0.191310 |
0.306053 |
2 |
A Crash Course in Data Science |
COURSE |
0.705882 |
Mixed |
0.155100 |
0.254319 |
825 |
Tools for Data Science |
COURSE |
0.764706 |
Beginner |
0.143030 |
0.240986 |
171 |
Crash Course on Python |
COURSE |
0.882353 |
Beginner |
0.095957 |
0.173089 |
1 |
A Crash Course in Causality: Inferring Causal… |
COURSE |
0.823529 |
Intermediate |
0.018709 |
0.036586 |
594 |
Mathematics for Data Science |
SPECIALIZATION |
0.705882 |
Beginner |
0.012674 |
0.024900 |