Knowledge Tracing 资源帖1

2023-05-16

介绍知识追踪的常见数据集和代码,博客等等等,我是勤快的搬运工,好好看

数据集

Knowledge Tracing Benchmark Dataset

There are some datasets which are suitable for this task,

KDD Cup 2010  https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp

ASSISTments ASSISTments (google.com)

 OLI Engineering Statics 2011  https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=507

JunyiAcademy Math Practicing Log [Annotation]  

slepemapy.cz  https://www.fi.muni.cz/adaptivelearning/?a=data

synthetic  synthetic (github.com)

math2015  

EdNet

 pisa2015math

 workbankr

 critlangacq

The following datasets are prov

The following datasets are provided by EduData ktbd:

Dataset NameDescription
syntheticThe dataset used in Deep Knowledge Tracing, original dataset can be found in github
assistment_2009_2010The dataset used in Deep Knowledge Tracing, original dataset can be found in github
junyiPart of preprocessed dataset of junyi, which only includes 1000 most active student interaction sequences .

详细见 EduData/ktbd.md at master · bigdata-ustc/EduData (github.com)

数据格式

知识跟踪任务中,有一种流行的格式(我们将其称为三行(tl)格式)来表示交互序列记录:

5

419,419,419,665,665

1,1,1,0,0

可在深度知识跟踪中找到。 以这种格式,三行由一个交互序列组成。 第一行表示交互序列的长度,第二行表示练习ID,后跟第三行,其中每个元素代表正确答案(即1)或错误答案(即0)

以便处理 某些特殊符号难以以上述格式存储的问题,我们提供了另一种名为json序列的格式来表示交互序列记录:

[[419,1],[419,1],[419,  1],[665、0],[665、0]]序列中的每一项代表一个交互。 该项目的第一个元素是练习ID(在某些作品中,练习ID不是一对一映射到一个知识单元(ku)/概念,但是在junyi中,一个练习包含一个ku),第二个元素是练习ID。 指示学习者是否正确回答了练习,0表示错误,1表示正确1行,一条json记录,对应于学习者的交互顺序。
    我们提供了用于转换两种格式的工具:

# convert tl sequence to json sequence, by default, the exercise tag and answer will be converted into int type
edudata tl2json $src $tar
# convert tl sequence to json sequence without converting
edudata tl2json $src $tar False
# convert json sequence to tl sequence
edudata json2tl $src $tar

Dataset Preprocess

https://github.com/ckyeungac/deepknowledgetracing/blob/master/notebooks/ProcessSkillBuilder0910.ipynb

EduData/ASSISTments2015.ipynb at master · bigdata-ustc/EduData (github.com)

ASSISTments2015 Data Analysis

Data Description

Column Description

FieldAnnotation
user idId of the student
log idUnique ID of the logged actions
sequence idId of the problem set
correct

Correct on the fisrt attempt or Incorrect on the first attempt, or asked for help

import numpy as np

import pandas as pd

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go

path = "2015_100_skill_builders_main_problems.csv"
data = pd.read_csv(path, encoding = "ISO-8859-15",low_memory=False)

Record Examples

pd.set_option('display.max_columns', 500)
data.head()
user_idlog_idsequence_idcorrect
05012116747803570140.0
15012116747804370141.0
25012116747805370141.0
35012116747806970141.0
45096416747804170141.0

General features

data.describe()
user_idlog_idsequence_idcorrect
count708631.0000007.086310e+05708631.000000708631.000000
mean296232.9782761.695323e+0822683.4748210.725502
std48018.6502473.608096e+0641593.0280180.437467
min50121.0000001.509145e+085898.0000000.000000
25%279113.0000001.660355e+087020.0000000.000000
50%299168.0000001.704579e+089424.0000001.000000
75%335647.0000001.723789e+0814442.0000001.000000
max362374.0000001.754827e+08236309.0000001.000000
print("The number of records: "+ str(len(data['log_id'].unique())))
The number of records: 708631

print('Part of missing values for every column')
print(data.isnull().sum() / len(data))
Part of missing values for every column
user_id        0.0
log_id         0.0
sequence_id    0.0
correct        0.0
dtype: float64

具体实现代码收集;

https://github.com/seewoo5/KT

DKT (Deep Knowledge Tracing)

  • Paper: https://web.stanford.edu/~cpiech/bio/papers/deepKnowledgeTracing.pdf
  • Model: RNN, LSTM (only LSTM is implemented)
  • GitHub: https://github.com/chrispiech/DeepKnowledgeTracing (Lua)
  • Performances:
DatasetACC (%)AUC (%)Hyper Parameters
ASSISTments200977.02 ± 0.0781.81 ± 0.10input_dim=100, hidden_dim=100
ASSISTments201574.94 ± 0.0472.94 ± 0.05input_dim=100, hidden_dim=100
ASSISTmentsChall68.67 ± 0.0972.29 ± 0.06input_dim=100, hidden_dim=100
STATICS81.27 ± 0.0682.87 ± 0.10input_dim=100, hidden_dim=100
Junyi Academy85.480.58input_dim=100, hidden_dim=100
EdNet-KT172.7276.99input_dim=100, hidden_dim=100
  • All models are trained with batch size 2048 and sequence size 200.

DKVMN (Dynamic Key-Value Memory Network)

  • Paper: http://papers.www2017.com.au.s3-website-ap-southeast-2.amazonaws.com/proceedings/p765.pdf
  • Model: Extension of Memory-Augmented Neural Network (MANN)
  • Github: https://github.com/jennyzhang0215/DKVMN (MxNet)
  • Performances:
DatasetACC (%)AUC (%)Hyper Parameters
ASSISTments200975.61 ± 0.2179.56 ± 0.29key_dim = 50, value_dim = 200, summary_dim = 50, concept_num = 20, batch_size = 1024
ASSISTments201574.71 ± 0.0271.57 ± 0.08key_dim = 50, value_dim = 100, summary_dim = 50, concept_num = 20, batch_size = 2048
ASSISTmentsChall67.16 ± 0.0567.38 ± 0.07key_dim = 50, value_dim = 100, summary_dim = 50, concept_num = 20, batch_size = 2048
STATICS80.66 ± 0.0981.16 ± 0.08key_dim = 50, value_dim = 100, summary_dim = 50, concept_num = 50, batch_size = 1024
Junyi Academy85.0479.68key_dim = 50, value_dim = 100, summary_dim = 50, concept_num = 50, batch_size = 512
EdNet-KT172.3276.48key_dim = 100, value_dim = 100, summary_dim = 100, concept_num = 100, batch_size = 256
  • Due to memory issues, not all models are trained with batch size 2048.

NPA (Neural Padagogical Agency)

  • Paper: https://arxiv.org/abs/1906.10910
  • Model: Bi-LSTM + Attention
  • Performances:
DatasetACC (%)AUC (%)Hyper Parameters
ASSISTments200977.11 ± 0.0881.82 ± 0.13input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200
ASSISTments201575.02 ± 0.0572.94 ± 0.08input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200
ASSISTmentsChall69.34 ± 0.0373.26 ± 0.03input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200
STATICS81.38 ± 0.1483.1 ± 0.25input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200
Junyi Academy85.5781.10input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200
EdNet-KT173.0577.58input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200
  • All models are trained with batch size 2048 and sequence size 200.

SAKT (Self-Attentive Knowledge Tracing)

  • Paper: https://files.eric.ed.gov/fulltext/ED599186.pdf
  • Model: Transformer (1-layer, only encoder with subsequent mask)
  • Github: https://github.com/shalini1194/SAKT (Tensorflow)
  • Performances:
DatasetACC (%)AUC (%)Hyper Parameters
ASSISTments200976.36 ± 0.1580.78 ± 0.10hidden_dim=100, seq_size=100, batch_size=512
ASSISTments201574.57 ± 0.0771.49 ± 0.03hidden_dim=100, seq_size=50, batch_size=512
ASSISTmentsChall67.53 ± 0.0669.70 ± 0.32hidden_dim=100, seq_size=200, batch_size=512
STATICS80.45 ± 0.1380.30 ± 0.31hidden_dim=100, seq_size=500, batch_size=128
Junyi Academy85.2780.36hidden_dim=100, seq_size=200, batch_size=512
EdNet-KT172.4476.60hidden_dim=200, seq_size=200, batch_size=512

https://github.com/bigdata-ustc/TKT

Knowledge Tracing models implemented by mxnet-gluon. For convenient dataset downloading and preprocessing of knowledge tracing task, visit Edudata for handy api.

Visit https://base.ustc.edu.cn for more of our works.

Performance in well-known Dataset

With EduData, we test the models performance, the AUC result is listed as follows:

model namesyntheticassistment_2009_2010junyi
DKT0.64387489588814870.74425734655419420.8305416859735839
DKT+0.80622213837904890.74834240879190350.8497422607539136
EmbedDKT0.48581687046606360.72855723019775860.8194401881889697
EmbedDKT+0.73409961818761870.74909008763560510.8405445812109871
DKVMNTBATBATBA

The f1 scores are listed as follows:

model namesyntheticassistment_2009_2010junyi
DKT0.58132374745843960.71343805080243690.7732850122818582
DKT+0.70418044633703870.71376277133438190.7928075377114897
EmbedDKT0.47168213111993860.70950251340796560.7681817174082963
EmbedDKT+0.63169536256582910.71017906049902280.7903592922756097
DKVMNTBATBATBA

The information of the benchmark datasets can be found in EduData docs.

In addition, all models are trained 20 epochs with batch_size=16, where the best result is reported. We use adam with learning_rate=1e-3. We also apply bucketing to accelerate the training speed. Moreover, each sample length is limited to 200. The hyper-parameters are listed as follows:

model namesynthetic - 50assistment_2009_2010 - 124junyi-835
DKThidden_num=int(100);dropout=float(0.5)hidden_num=int(200);dropout=float(0.5)hidden_num=int(900);dropout=float(0.5)
DKT+lr=float(0.2);lw1=float(0.001);lw2=float(10.0)lr=float(0.1);lw1=float(0.003);lw2=float(3.0)lr=float(0.01);lw1=float(0.001);lw2=float(1.0)
EmbedDKThidden_num=int(100);latent_dim=int(35);dropout=float(0.5)hidden_num=int(200);latent_dim=int(75);dropout=float(0.5)hidden_num=int(900);latent_dim=int(600);dropout=float(0.5)
EmbedDKT+lr=float(0.2);lw1=float(0.001);lw2=float(10.0)lr=float(0.1);lw1=float(0.003);lw2=float(3.0)lr=float(0.01);lw1=float(0.001);lw2=float(1.0)
DKVMNhidden_num=int(50);key_embedding_dim=int(10);value_embedding_dim=int(10);key_memory_size=int(5);key_memory_state_dim=int(10);value_memory_size=int(5);value_memory_state_dim=int(10);dropout=float(0.5)hidden_num=int(50);key_embedding_dim=int(50);value_embedding_dim=int(200);key_memory_size=int(50);key_memory_state_dim=int(50);value_memory_size=int(50);value_memory_state_dim=int(200);dropout=float(0.5)hidden_num=int(600);key_embedding_dim=int(50);value_embedding_dim=int(200);key_memory_size=int(20);key_memory_state_dim=int(50);value_memory_size=int(20);value_memory_state_dim=int(200);dropout=float(0.5)

The number after - in the first row indicates the knowledge units number in the dataset. The datasets we used can be either found in basedata-ktbd or be downloaded by:

pip install EduData
edudata download ktbd

Trick

  • DKT: hidden_num is usually set to the nearest hundred number to the ku_num
  • EmbedDKT: latent_dim is usually set to a value litter than or equal to \sqrt(hidden_num * ku_num)
  • DKVMN: key_embedding_dim = key_memory_state_dim and value_embedding_dim = value_memory_state_dim

Notice

Some interfaces of pytorch may change with version changing, such as

import torch
torch.nn.functional.one_hot
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

Knowledge Tracing 资源帖1 的相关文章

随机推荐

  • 知识追踪入门系列-论文资料汇总

    Paper xff1a 知识追踪相关论文 下载论文和代码见reference第一个链接 Deep Knowledge Tracing 首次提出将RNN用于知识追踪 xff0c 并能够基于复杂的知识联系进行建模 xff08 如构建知识图谱 x
  • 知识追踪方法比较

    DKT xff1a Deep knowledge tracing In Advances in neural information processing systems 这是一种开创性的方法 xff0c 它使用单层LSTM模型来预测学生的
  • 机器学习 注意力 笔记资料贴

    Self Attention与Transformer详解 https zhuanlan zhihu com p 47282410 写的非常详细 https jalammar github io illustrated transformer
  • 图像的几何变换maketform imtransform imresize imcrop

    背景 几何变换是将图像像素从一个位置映射到另一个位置 几何变换有五种常见类型 xff1a 剪切变换 平移变换 缩放变换 旋转变换和投影变换 它们如图4 1所示 在该图中 xff0c 原始图像显示在 A 中 xff0c 而变换后的图像显示在
  •  决策树(Decision Tree)原理及实现

    决策树 xff08 Decision Tree xff09 原理及实现 一 算法简介 1 1 基本模型介绍 决策树是一类常见的机器学习方法 xff0c 可以帮助我们解决分类与回归两类问题 模型可解释性强 xff0c 模型符合人类思维方式 x
  • Python 一维及多维数组及基本操作

    2 创建一般的多维数组 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 import numpy as np a 61 np array 1 2 3 dty
  • java操作word方式 设置国内镜像命令

    java操作word方式还有 个人认为通过jacob最好 xff0c 自己可以扩展 xff0c 网上除poi之外几乎全是java com技术实现的 1 Apache POI Java API To Access Microsoft Form
  • matlib 多种方法实现图像旋转不使用imrotate函数

    原理 方法很棒https blog csdn net qq 41140138 article details 104737705 方法一 function g 61 rotate image1 f theta M N 61 size f t
  • MATLAB实现满秩LU/QR等分解 及求方程组的解 范数

    矩阵分解 矩阵的LR分解 方阵A是非奇异的 clear A 61 2 1 4 4 3 13 2 2 20 format rat L U 61 lu A L U P 61 lu A 矩阵QR分解 Q R 61 qr A xff09 R为上三角
  • Deep Knowledge Tracing(DKT)具体实现

    有关dkt可参考的文章http blog kintoki me 2017 06 06 tensorflow dkt 问题陈述 传统的评价方法 xff0c 如考试和考试 xff0c 只允许在考试结束后对学生进行评价 因此 xff0c 这些方法
  • pandas数据预处理 缺失值

    缺失值的分类 按照数据缺失机制可分为 xff1a 可忽略的缺失 完全随机缺失 missing completely at random MCAR xff0c 所缺失的数据发生的概率既与已观察到的数据无关 也与未观察到的数据无关 随机缺失 m
  • 数据预处理之数据清洗案例

    建议学习文章 xff1a https zhuanlan zhihu com p 111499325 https mp weixin qq com s jNoXHO4qU34gcha4zOGRLA https mp weixin qq com
  • ERROR conda.core.link:_execute(502): An error occurred while installing package

    记录错误 ERROR conda core link execute 502 An error occurred while installing package xff1a http mirrors tuna tsinghua edu c
  • 深度学习之Bias/Variance偏差、方差

    偏差 xff08 Bias xff09 和方差 xff08 Variance xff09 是机器学习领域非常重要的两个概念和需要解决的问题 在传统的机器学习算法中 xff0c Bias和Variance是对立的 xff0c 分别对应着欠拟合
  • Image Processing in the Spatial Domain 空间域图像处理

    背景 二维卷积 在二维卷积中 xff0c 我们通过卷积核对输入图像进行卷积来计算输出图像 卷积核是一个小尺寸的矩阵 xff0c 例如3 3 5 5或5 7像素 xff1b 这个矩阵中的项称为卷积系数 在二维相关中 xff0c 我们通过将输入
  • Knowledge Tracing Project数据分析/挖掘

    本项目我们遵循以下工作流程 1项目概况2 数据理解3 头脑风暴4 数据清理5 探索性数据分析6 特色工程7 功能选择8 型号9 选型10 参数微调11 进一步改进 项目概述 目标是根据学生之前的学习经验预测学生是否能够正确回答下一个问题 数
  • snprintf 函数用法

    snprintf 函数用于将格式化的数据写入字符串 xff0c 其原型为 xff1a int snprintf char str int n char format argument 参数 str为要写入的字符串 xff1b n为要写入的字
  • 用MATLAB进行区间估计

    数据正态总体分布normfit 命令来完成对参数的点估计和区间估计 此命令以alpha为显著性水平 xff0c 在数据X下 xff0c 对参数进行估计 xff08 alpha缺省时设定为0 05 xff09 muhat sigmahat m
  • Implicit Heterogeneous Features Embedding in Deep Knowledge Tracing论文阅读

    资源 论文和数据集下载 xff1a 深度知识追踪 rar 蓝奏云 lanzous com 决策树实现 xff1a Implicit Heterogeneous Features Embedding in Deep Knowledge Tra
  • Knowledge Tracing 资源帖1

    介绍知识追踪的常见数据集和代码 xff0c 博客等等等 xff0c 我是勤快的搬运工 xff0c 好好看 数据集 Knowledge Tracing Benchmark Dataset There are some datasets whi