Let’s pause for another hands-on exercise. Revisit our EvaluateUserCF script again, and this time modify the recommender such that users that have a rating count more than 3 standard deviations from the mean are excluded from consideration. This will eliminate so-called “super-users” who have an outsized impact on your results. Let’s measure the effect of filtering them out.
To do this, you’ll probably want to focus your attention on the MovieLens module again – the best place to filter out these outliers is in the function that actually loads the MovieLens data set itself. Doing this easily will require some familiarity with the Pandas module, so if you’re new to Pandas, you might want to just skip to my solution and learn from it. But if you are starting to feel comfortable with Pandas, give it a shot yourself – and compare your results to mine, up next.
So to see how I went about filtering outliers in the MovieLens data set, open up the MovieLens3.py file in the Challenges folder of the course materials.
The changes are in the loadMovieLensSmall function. You can see I’ve re-implemented it such that it uses Pandas to load up the raw ratings data, and then I use Pandas to filter out those outliers. The resulting dataset for our recommender framework is then built up from the resulting Pandas dataframe, instead of directly from the csv ratings file.
We start by loading up the ratings data file into Pandas, into a dataframe called ratings. We print out the start of it, and the shape, so we can see what the data we’re starting with looks like, and how much of it there is.
Next we need to identify our outlier users, which first means counting up how many ratings each user has. We use the groupby command on line 34, together with the aggregate command, to build up a new dataframe called ratingsByUser that maps user ID’s to their rating counts – and we print it out so we can see what we’re working with as we develop and debug all of this .
Now we’re going to add a column to that ratingsByUser dataframe that indicates whether or not this user is considered an outlier. Pandas is pretty neat in that we can just define a new column by assigning it to a set of operations on the existing dataframe. Here, we’re taking the absolute value of the difference between each user’s rating count and the mean rating count across all users, and comparing that to the standard deviation of the rating counts multiplied by the value we choose, which is 3 by default. So, this line adds a new “outlier” column that is true if the rating count for this user is more than three standard deviations from the mean, or false otherwise. We don’t need the actual rating column any longer, so we drop that, and print out what we have at this stage.
Next, we merge in this dataframe that indicates who are outliers and who aren’t, with our original ratings dataframe, based on the user ID’s. This gives us a handy outlier column on the raw individual ratings data. Again we print out what we have so far on line 45.
Next, we’ll get rid of all the outlier data on line 47 by creating a new dataframe called “filtered” that is just the merged dataframe, but only including rows where the outlier column is false. We can then drop the outlier and timestamp columns from the resulting dataframe, as we don’t need that data any more.
At this point we have what we want – a dataframe of individual user ID’s, movie ID’s, and ratings that excludes ratings from users that we consider an outlier. On lines 49-51 we print out the resulting dataframe and its shape, so we can get a sense of how many ratings got dropped as a result of all this.
From there, we’re fortunate that supriselib has a convenience function for creating a dataset from a Pandas dataframe, which is what we’re doing on line 54 .
So, let’s open up EvaluateUserCF-Outliers.py, and kick it off to see what happens.
You can see all of our debugging output that illustrates what happens at each stage within our loadMovieLensLatestSmall function, as we build up the outlier data and exclude ratings from those outliers. An important take-away is that we started off with 100,004 ratings, and ended up with 80,398. So even with the fairly conservative definition of an outlier as being beyond 3 standard deviations, we ended up losing quite a few ratings.
The resulting hit rate was 4.4%, and we saw hit rates from this same algorithm over 5% earlier. So it would seem that in this case, eliminating outliers did more harm than good. That’s just specific to this data set, however – it turns out that the MovieLens data set we’re working with has already been filtered and cleaned, and we know all of this data represents real, individual people. So by filtering out outliers, we’re just filtering out valid information. But in a real world situation, things will probably be quite different, and you’ll find that removing outlier users or even outlier items can have a measurable improvement on your results .
Malicious User Behavior
Another real-world problem is people trying to game your system. If items your recommender system promotes to your users leads to those items being purchased more, the makers of those items have a financial incentive to find ways to game your system into recommending their items more often. Or, people with certain ideological agendas might purposely try to make your system recommend items that promote their ideology, or to not recommend items that run counter to it. Some hacker might even be bored, and try to create humorous pairings in your recommender system just for their own amusement. “Google bombs” are an example of this.
Fighting people like this is generally a never-ending arms race, but there is one technique that works remarkably well: make sure that recommendations are only generated from people who actually spent money on the item. Recommendations based on implicit ratings from purchasing data are almost impervious to these sorts of attacks, because it would be prohibitively expensive for someone to buy enough of an item to artificially inflate its presence in your recommendations. And, when people “ vote with their wallets”, it’s a very strong and reliable indication of interest that leads to better recommendations overall.
Sometimes, however, you don’t have enough purchase data to work with. There are still precautions you can take, however. For example, if your recommender system is based on star reviews, you can make sure that you only allow reviews from people you know actually purchased or consumed the content in question. If you allow people to rate items they haven’t actually seen or used, you’re opening yourself up to all sorts of attacks. And using click data should always be a last resort – it’s very easy to fake click data using bots, and even if it’s not a bot, click data has its own set of problems.
The Trouble with Click Data
Using implicit clickstream data, such as images people click on, is fraught with problems. You should always be extremely skeptical about building a recommender system that relies only on things people click on, such as ads. Not only are these sorts of systems highly susceptible to gaming, they’re susceptible to quirks of human behavior that aren’t useful for recommendations. I’ve learned this the hard way a couple of times.
If you ever build a system that recommends products based on product images that people click on when they see them in an online ad, I promise you that what you build will end up as a pornography detection system. The reality is, people instinctively click on images that contain images that appear sexual in nature. Your recommender system will end up converging on products that feature pictures that include a lot of flesh, and there won’t be anything you can do about it. Even if you explicitly filter out items that are sexual in nature, you’ll end up discovering products that just vaguely look like sex toys or various pieces of sexual anatomy.
I’m not making this up – I’m probably not at liberty to talk about the details, but I’ve seen this happen more than once. Never, ever build a recommender system based on image clicks! And implicit data in general tends to be very low quality, unless it’s backed by a purchase or actual consumption. Clickstream data is a very unreliable signal of interest. What people click on and what they buy can be very different things .
International Considerations
We’re not done yet! There are just a lot of little “gotchas” when it comes to using recommender systems in the real world, and I don’t want you to have the learn them the hard way.
Another consideration is dealing with International markets. If you recommender system spans customers in different countries, there may be specific challenges to that you need to consider.
For example, do you pool international customer data together when training a recommender system, or keep everything separated by country? In most cases, you’ll want to keep things separate, since you don’t want to recommend items in a foreign language to people who don’t speak that language, and there may be cultural differences that influence peoples’ tastes in different countries as well.
There is also the problem of availability and content restrictions. With movies in particular, movies will often be released on different schedules in different countries, and may have different licensing agreements depending on which country you’re in. You may need to filter certain movies out based on what country the user is in before presenting them as a recommendation. Some countries have legal restrictions on what sort of content can be consumed as well, which must be taken into consideration. You can’t promote content about Nazi Germany within Germany for example, nor can you promote a long list of political topics within China.
Since your recommender system depends on collecting data on individual interests, there are also privacy laws to take into consideration, and these too vary by country. I am not a lawyer, and these laws are changing all the time, but you’ll want to consult with your company’s legal department and IT security departments to ensure that any personal information you are collecting in the course of building your recommender system is being collected in accordance with international laws.
The Effects of Time
A topic that is really under-represented in recommender system research is dealing with the effects of time .
One example is seasonality. Some items, like Christmas decorations, only make for good recommendations just before Christmas. Recommending bikinis in the dead of winter is also a bad idea. Picking up on annual patterns like this is hard to do, and most recommender systems won’t do it automatically. As far as I know, this is still an open problem in the research arena. Those of you looking for a master’s thesis topic – there you go.
But, something you can do more easily and more generally is taking the recency of a rating into account. Netflix in particular found that these sorts of temporal dynamics are important; your tastes change quickly, and a rating you made yesterday is a much stronger indication of your interest than a rating you made a year ago. By just weighting ratings by their age, using some sort exponential decay, you can improve the quality of your recommendations in many cases – or you can use rating recency as a training feature of its own, in addition to the rating itself.
As we mentioned in the context of Amazon, the product you’re looking at right now is the most powerful indicator of your current interest. Time plays a big part there, as well.
Whenever you train a recommender system using historical ratings data, you are giving your system a bias toward the past. A Netflix that didn’t take time into account would end up recommending a lot of old shows to you instead of the hot, newer ones you want to see. If the things you’re recommending are time-sensitive in any way, it makes sense to use rating recency as a training feature of its own .
Optimizing for Profit
I could keep on going with real-world lessons, but I’ll end with this one.
In the real world, a recommender system you’re building for a company will ultimately exist for the purpose of driving their profit. You may find yourself being asked to optimize your recommender system for profit, instead of pure relevance.
What does that mean? Well, normally you’d test different recommender systems based on what drives the most purchases, video views, or some other concrete measure of whether a customer liked your recommendation enough to act on it. If you’re trying to build a recommender system that introduces people to new stuff they want, then that’s the right thing to do. You’d want to just optimize on the quantity of items your recommender system drove the consumption of, not how much money those items made for the company .
But in the real world, some items are more profitable than others. Some items might even be offered at below cost as loss leaders to attract people to your site, and recommending them actually costs your company money.
This presents a bit of a moral quandary for developers of recommender systems. Do we keep our algorithms “pure” and optimize only for users’ interests, or do we work profit into the equation by rewarding algorithms that generate more profit? The euphemism for this is “value-aware recommendations,” where the contribution to the company’s bottom line plays a role in whether something is recommended.
If you’re ever faced with this dilemma, a reasonable compromise is to only use profitability as a tie-breaker. If you’re showing top-10 recommendations to someone, and the underlying scores for the top 2 items in that list are about the same, it wouldn’t do any harm to show the more profitable item first. Items higher in a list are more likely to be clicked on (we call this “position bias”,) and if you’re really not sure which item should come first, you may as well go with the one that will earn more money for your company.
Optimizing too much for profit can backfire, though. You don’t want to end up only recommending expensive items to your customers, because they’ll be less likely to purchase them. It might make more sense to look at profit margin, and not actual profit, so the cost of the item isn’t really a factor in what you recommend.