Deep Knowledge Tracing in Education

Hi, I'm Ryan Baker and this is Big Day in Education. Today we're going to discuss the deep knowledge tracing family of algorithms. So I'll start this off by noting that some of the slides in this lecture were developed in collaboration with Richard Scruggs.

Our story starts with DKT, deep knowledge tracing, by Chris Peach and his colleagues. Deep knowledge tracing is based on long short-term memory networks and it fits on a sequence of student performance across skills, predicting performance on future items within the system. Like all neural networks of its sort, it can fit very complex functions, very complex relationships between items over time.

So our story begins with a little bit of a drama, because the initial paper, and actually an initial press release published even before the paper was submitted, reported massively better performance than original BKT or PFA. And in fact the difference was so big that it seemed at first too good to be true, and it was. Xiaoluxiang and his colleagues reported that Peach et al. had used the same data points for both the training and test, which, as you should know by this point in the course, is a massive sin. Following on that work, Kaja and his colleagues compared DKT to modern extensions to BKT on the same data set and found that it was particularly beneficial to refit the item skill mappings. Kevin Wilson and his colleagues compared DKT to temporal IRT on the same data set and The bottom line was that all three approaches appeared to perform comparably well.

At the time, curmudgeons like me said, ah, this DKT stuff, not gonna do anything. But it turned out to be the beginning of what could be called DKT family algorithms, a range of knowledge tracing algorithms based on different variants on deep learning. And at this point, there's now literally hundreds of published variants, most of them tiny little tweaks to get tiny little gains in performance, but in aggregate across the hundreds of publications, There appears to be some real improvements to predictive performance. There was a comparison between various DKT family variants and more classical algorithms, for example, in Gervais et al. 2020, that showed pretty conclusively that DKT family really is getting better performance.

In the next slides, I'm going to discuss some of the key issues that researchers have tried to address in these improvements to DKT and what their approaches were. One of the first problems noted for DKT was work by Jung and Jung, who reported degenerate behavior for DKT. And we talked a little about degeneracy for BKT and PFA, but this degeneracy was even worse.

People getting answers right and the knowledge estimates dropping. Wild swings in probability estimates in short periods of time, like a student gets it right and they go from 30 percent probability of knowing it to 95 percent. They get it wrong, they shoot back down to 20 percent. Now, These issues really kind of raise questions about whether DKT could actually be used in the real world, because we can't have these kind of behaviors in a real system being used by students. But Jung and Jung proposed adding two types of regularization to modulate these swings, increasing the weight of the current prediction for future prediction, and reducing the amount the models allowed to change future estimates.

And those regularization steps actually made a big difference in already making DKT just with their first extension, DKT+, a much more realistic algorithm for a rare-world use. It's worth noting, however, that DKT is impossible to interpret in terms of skills, because DKT predicts individual item correctness, not skills. So what do you do for an entirely new item that you don't have any data on?

Because it's not part of a skill, it's impossible to make a prediction for it. And, furthermore, what information can you provide teachers? It's not very useful to tell teachers...

here are the 17 items that I predict the student will get right, and here are the 3d4 items that I predict they'll get wrong. To address this, Zhang et al. proposed an extension to DKT called DKVMN that fits in item skill mapping too. It's based on memory augmented neural networks, and it keeps an external memory matrix that the neurons update and refer back to that effectively map to skills. And in this approach, latent skills are discovered by the algorithm, and consequently, they're difficult to interpret.

So... This attempted to have latent knowledge estimation, but not in a way that was immediately applicable to teachers or to unseen items. In further work, Li and Jung proposed an alternative to DKT called KQN that attempts to output more interpretable latent skill estimates. It again fits an external memory network to fit skills, and it also attempts to fit the amount of information transfer between skills.

But still, in my view, not that interpretable. Jung 2019, a different Jung, although actually at the same university, proposed an alternative to DKT called DeepIRT that attempted to output more interpretable latent skill estimates. Again, it fits an external memory network to fit the skills, and now fits a separate network to estimate student ability and item difficulty.

It then uses that estimatability and difficulty to predict correctness with an item response theory model. And it's somewhat more interpretable than the previous approaches, or at least the IRT half of it is. Now one caveat for skill estimation. Some deep learning based algorithms attempt to estimate skill level, but their skill estimates are rarely if ever compared to post-tests or other estimates of skill level independent from performance in the system itself.

And that's in part because most large data sets don't have that data available. Therefore, even though they're making these skill estimates, we don't actually really know if the estimates are any good. And to address this, Scruggs et al. proposed AOA, an extension that can be done to any knowledge tracing algorithm, although he originally designed it for DKT family algorithms. In AOA, a human-derived skill item mapping is used.

The predicted performance on all items in the skill are averaged, including both unseen and already seen items, in order to derive a latent estimate for future problems in that skill. Now, This is super simplistic, but it actually works, because it led to successful prediction of post-tests outside the learning system in two different papers, better than BKT or ELO did. So BKT and ELO were seen, by many, including me, as being superior to DKT family in terms of their ability to predict an external latent, but when you actually map it back to a human interpretable external latent, it does better than those algorithms. Another area of concern that people have raised is, what is DKT really learning? And Ding and Larson demonstrated theoretically that a lot of what DKT learns is actually just how good a student is overall.

Zhang et al. followed this up with empirical work showing that most of the improvement in performance for DKVMN is actually in the first attempt on a new skill. So that really kind of corresponds to Ding and Larson's finding that it's really about how good the student is. More broadly, that benefit, and you can see on this graph here, dissipates mostly by the second practice attempt on a skill.

In particular, there's essentially no benefit to deep learning after several attempts on the skill, which is about the point where students often reach mastery, if they didn't already know the skill. So a lot of what DKT is doing is not helping us assess whether a student has reached mastery during their practice in an online learning system, but just what skills they probably already knew to begin with. Now, beyond these issues that people have tried to address, people have also just tried to make DKT better. And one of the important variants on this is called SACT by Pandey and Kouripas, who propose a DKT variant that fits attentional weights between exercises and more explicitly predicts performance on the current exercise from performance on related past exercises.

This gets a little bit better fit, but it doubles down a little bit more on some limitations we've already discussed. Gauch et al. also proposed a DKT variant called AKT, which explicitly stores and uses the learner's entire past practice history for each prediction, using an exponential decay curve to downweight past actions, and using a Roche model embedding to calculate item difficulty. And, unshockingly, this does better because it's now using additional information.

It's using the time sequence. Building on that general paradigm of adding in more information, Saint Plus by Shin et al. added elapsed time and lag time as additional inputs. leading to better performance.

And adding in more information still, ProcessBERT added timing and use of resources like a calculator, finding that this further additional information led to better performance. Now in discussing these, I've tried to be fairly calm about how big the benefits were, but there's actually a curious methodological note that I kind of have to report, which is that most DKT family papers report not tiny improvements but large improvements over previous algorithms. including other DKT family algorithms.

And those improvements somehow seem to mostly or entirely dissipate in the next paper. And what's going on here? Poor validation and overfitting, unfortunately. A lot of DKT family papers don't use student-level cross-validation, which, my gosh, why didn't they watch Big Data and Education?

Poor cross-validation benefits DKT family algorithms more than other algorithms because DKT family fits more aggressively. So, in other words, a lot of the gigantic improvements that we see in individual DKT papers are because the authors are literally using data from a student's future to predict their past. Also almost as troubling, a lot of DKT family papers fit their own hyperparameters for their current new algorithm but use past hyperparameters for other algorithms, which of course is going to massively bias in favor of their new one because they've got more flexibility of fit. Having said that, despite the shoddy evaluation in a lot of DKT family papers, there have been solid evaluations of them, and those have typically demonstrated that there are benefits to DKT. My current favorite, even though it's like now three years old, is Gervais et al., who compares a bundle of KT algorithms on several datasets.

And I love that not only were their methods solid, but that they looked across a whole lot of datasets. Some of their key findings. Different datasets have different winners.

A surprising number of DKT family papers are on a single dataset, and Gervais et al. find that that's going to always lead to kind of getting an idiosyncratic winner. They also found that the DKT family algorithms perform better than other algorithms on large datasets, but worse on smaller datasets. And in particular, DKT family algorithms perform worse than the LKT family on datasets with very high numbers of practices per skill, such as language learning domains.

They do find that DKT family algorithms are better at predicting if the exact order of the items matters, which can occur if items within a skill vary a lot. And they find that DKT family algorithms reach path performance faster than other algorithms, which corresponds also to the Jean et al. findings I mentioned a few minutes ago. In a second major evaluation, Schmucker et al. 2022 compared KT algorithms on several large datasets, tuning all the model's hyperparameters from scratch.

They found that their feature-based logistic regression model outperformed all the other approaches on nearly all the datasets tested. In their comparison, DKT was the best performing algorithm on one dataset, but, curiously enough, the later DKT family variants were outperformed by standard DKT on every dataset in their comparison. Which is really kind of concerning, because this was the first paper to really retune all those models from scratch, not have any cheating, not have any researcher degrees of freedom, and suddenly all this later work, which so much energy had gone into, suddenly didn't look that much better anymore. The DKT family work is moving so quickly. There's literally dozens of papers a year.

And I think that one of the key frontiers for this family of algorithms is getting beyond correctness. Gauch et al. propose option tracing, which extends the output layer of the neural network to predict not only correctness, but which multiple choice item the student will select. And that has a lot of potential.

for being able to capture misconceptions as well as correct knowledge. Open-ended knowledge tracing, Lou et al.'s work, goes even beyond that in integrating KT with a GPT-2 model that's fine-tuned on 2.1 million Java code exercises and written descriptions of them in order to generate predicted specific code which will make predicted specific errors, helping to get at not just what a student knows but what they don't know and what misconceptions they have. So going forward, Work on the DKT family of algorithms continues. There's been literally dozens of recent papers trying to get better results by adjusting and tweaking the deep learning framework in various ways. Better results in these papers almost universally means higher AUC values for predictions of next-item correctness on test data and selected datasets.

However, as Schmucker and colleagues showed in 2022, better results on some datasets do not always translate to better results on all datasets. So you might be asking yourself at this point, Why do you want to use a DKT family algorithm anyways? They've got some unstable performance. Well, maybe you just care primarily about predicting next problem correctness. Or you're willing to use a method like AOA to get skill estimates.

Maybe you have unreliable skill tags, or no skill tags at all, in which case you can't use BKT or PFA or LKT. Or maybe you want better estimation right from the beginning of a new skill, where we know that DKT family does better. In these cases, you probably also have a data set with a reasonably balanced number of attempts.

Or you just don't care as much about items or skills with fewer attempts. And, probably, your dataset has students working through material and predefined sequences, because if you don't have that, DKT loses a lot of its ability to make better predictions. But on the other hand, you might not want to use a DKT family algorithm if you want interpretable parameters, if you have a small dataset, and it's a pretty big definition of small, like under a million interactions counts as small here.

And you may remember that Slater and Baker showed that you could get Decent prediction even for the most stringent case for BKT with 250 students and 6 actions per student. That's what a little over a thousand? 1500?

Very different than 1 million. You also might want to use a DKT family algorithm if you want to add new items without refitting the model because how are you gonna know what skill a new item fits with? And finally, you might not want to use a DKT family algorithm if you want an algorithm with more thoroughly understood and more consistent behavior.

So next up And last up, I think, for our KT week, memory algorithms.

Transcript for:Deep Knowledge Tracing in Education

Transcript for:
Deep Knowledge Tracing in Education