Bleeding Edge Alert! Sparse Linear Methods
It’s time for another bleeding edge alert! This is where we talk about some recent research that has promising results, but hasn’t yet made it into the mainstream with recommender systems yet.
In this bleeding edge alert, we’re actually going back to the year 2011 to look at sparse linear methods, or SLIM for short. This came out of the University of Minnesota, which has been a pioneer in the field recommender systems from the very beginning. The results from SLIM were very exciting, and I don’t know why we don’t hear about it being used in industry more. It’s something definitely worth watching and keep an ear out for.
What caught my attention with this paper is how consistently SLIM outperformed a wide variety of competing algorithms, on a wide variety of data sets. Although, they didn’t include SVD++ in their comparisons – but they did compare against many similar algorithms. Also, they measured their success on hit rate, so their focus is very much on top-N recommendations instead of rating prediction accuracy, which again is the right thing to focus on.
But not only did SLIM beat the pants off of everything else on the Netflix data set, it also wins with data set from a book store, from Yahoo Music, from credit card purchasing data, and a couple of mail-order retailer data sets. The only data set where it didn’t come out on top was, ironically, MovieLens – but it placed a close second.
The idea behind SLIM is to generate recommendation scores for a given user and item by a sparse aggregation of the other things that user has rated, multiplied by a sparse aggregation of weights, which is where the magic lives. Here’s what the notation in the paper looks like – tilde a sub i-j represents the unknown score for a given user for a given item, and that’s equal to the entire row of stuff that user has rated, here indicated by a sub I T, multiplied by some precomputed weights associated with that row. Because the user has only rated some items and weights only exist for these items, that’s where the word sparse comes from in this method.
So if you extend this idea to the entire user/item rating matrix, it looks like this:
Pretty simple, right? To predict ratings, we just multiply the rating we know about by some magical weight matrix called W. Both A and W are sparse matrices, meaning it contains incomplete data – just the ratings that actually exist.
So the secret sauce is in how W is computed, and this is where it gets complicated. So complicated that a lot of people have trouble implementing it, which might explain why it’s not more widely used. This is just the first couple of paragraphs from the paper that describes how to do it. At its heart, it’s an optimization problem just like we used stochastic gradient descent to learn the matrices in SVD. But you seriously need a higher degree in mathematics to understand what’s going on here. If phrases like “matrix Frobenius norm” and “l1-norm regularization” make sense to you, then knock yourself out. But otherwise, I’ll be snooping around on Github for people who have already implemented this successfully and shared their code for it. I did find one, but the guy who wrote it had trouble getting it to work at scale. It’s not easy stuff at all.
Still, you can’t argue with SLIM’s results, so if you see it offered as part of a machine learning library you’re working with, it’s definitely worth your time to experiment with it. The research community has taken note, and some extensions on SLIM have emerged, such as contextual SLIM and higher-order SLIM, or HOSLIM. Even in the original paper, an extension called fsSLIM, or feature selection SLIM, is described, which restricts SLIM to use only columns most similar to what you want – this improves hit rate ever so slightly.
So if a SLIM implementation comes your way, give it a second look. It’s promising stuff.