Colin Green

Over the last 3 months I've been working on (or competing in) the Netflix Prize (www.netflixprize.com). Actually I came to the party late as the competition started in October '06 and (alias)Simon Funk posted a blog entry describing his SVD (Singular Value Decomposition) technique in December '06. Simon's entry was something of a bombshell as most competitors, if not all, are now using some variation on it, at least in part.

I'm fairly proud of my own SVD algorithm written in C#/.Net 2.0 as it crams the Netflix data consisting of 100 million ratings into just 275 MB, this packing of the data actually improves performance because each rating is used on each iteration(epoch) of the SVD alogorithm, thus it is effectively a memory bandwidth bound algorithm. As evidence of this I find that running two SVD algorithms together on a dual core machine results in both running at half speed (or thereabouts), basically because each core spends most of it's time twiddling it's thumbs waiting for data to arrive on the memory bus.

This along with some knowledge of optimisation and cutting edge PC hardware (Intel Core 2 Duo) means I can perform one SVD iteration in under 2 seconds. Pretty darn fast.

Even so my current setup requires at least 120k epochs to get a good result, and in fact some recent results are suggesting that more like like 200-300k+ epochs are in order to get the best possible result.

So while waiting for one of these runs to complete I got thinking about whether this competition is actually winnable. It's a subject that has come up a few times in the Netflix Prize forums.

Basically the goal is to get the root mean squared error (RMSE) of your predictions <=0.8563. Netflix's own Cinematch algorithm scores 0.9514 on the competition data, my own SVD algorithm scores about 0.9060 right now and the top entry on the leaderboard at time of writing is 0.8808, pretty good. As a reference point just using the average rating of each movie for all predictiosn gives an RMSE of abut 1.06 and some basic refinements on that approach (as outlined by Simon Funk) can get you down to 0.9840.

So the whole competition is basically compressed into that tiny range between 0.9840 (the easy to obtain baseline RMSE if you like) and 0.8563 (the million dollar prize!). That's a range of 0.1277. In those terms the prize seems teasingly close, but of course the RMSE score is subject to the law of diminishing returns - expend twice as much effort to achieve half as much gain. So as the leading score inches forward it's easy to come to the conclusion that they are simply approaching some lower bound by smaller and smaller increments, and although near to 0.8563, never actually attaining it.

OK but a couple of times over the (relatively short) life of the competition there have been significant leaps in best RMSE as new entrants have come in with new algorithms. So who is to say the lower bound is actually some way off where we are right now, and below 0.8563? So are there any ways of getting an estimate for where this lower bound is? Maybe.

SVD old timers will be well aware of the problem of over fitting. Getting a much better RMSE on the training data set than on the probe set. Simon Funk introduced a 'regularisation' aspect to his algorithm to help alleviate this problem, but it is still very much present. Bascially Simon decays the SVD feature values by some small amount (generally referred to as K, a proportion) in the feature value update expression. Without this decay the probe RMSE actually starts to rise after a few features as the algorithm overfits to the training data. With a relatively small K of 0.015 each feature seems to last for more epochs before completing (no longer falling on the probe RMSE) and also gives a better probe RMSE at the end of each feature. Also the training set RMSE is actually higher as it is not overfitting quite so much.

Discussions on the Netflix Prize forums mostly seem to be referring to K=0.015 as a good near optimal value, which I had taken at face value. Some recent experiments though suggest that a higher K gives better results, albeit at the expense of longer clock time to achieve those results (more epochs required per feature). K=0.025 seems like a better K to me right now, 0.05 at first sight seems to be one step too far as the resulting probe RMSE per feature is now higher than that obtained for K=0.025.

But is 0.05 worse over the long term? My hunch is that higher values of K will result in runs taking longer and longer but ultimately reaching lower and lower scores. I already have some fairly minimal evidence to support this so lets run with it for now. Lets say that we can just keep on increasing K, eventually the plot of probe and traing RMSE might actually be very close to each other, the two plots will have (almost) converged. If we experimentally find where this convergence point is we might have found the lower bound for the competion RMSE, at least for SVD, and just maybe that lower bound will be <=0.8563, maybe.

Hmm, maybe not. Consider this, the baseline predictions (as set out by Simon) give a probe RMSE of 0.9840. We know that K=1 will prevent any learning and thus the probe and training RMSE converge at 0.9840. Reducing K now results in the probe and trainign RMSE falling but separating. Thus our convergence point isn't between the probe and training RMSE we observe in 'normal' runs, it is actually above those values, at the baseline RMSE!

Ok so lets take another line. SVD can achieve an RMSE of <0.8 on the training set with just a dozen features. We have approximated 100 million predictions with an RMSE of just 0.8, that's pretty good for 100m ratings. Presumably if we increase the number of ratings we train against that figure would go up, as overfitting would be reduced. If we double, quadrupled the # of ratings that RMSE would rise by less and less as it converged on (but never reached) the magic lower bound, but this time from below.

We don't have more ratings (Netflix do BTW) but we should be able to get a good idea of the relationship between # of ratings and achievable RMSE by running SVD with fewer ratings and extrapolating. Expect the curve of RMSE achieved by each subsequent increase in # of ratings to plot an exponential curve, doubling the # of ratings each time results in an increase of RMSE that halves each time, something like that.

So for number of ratings we could use...

100m
50m
25m
12.5m
6.25m
3.125m
1.5625m
0.78125m

I expect that the smaller sample sizes will be noisier, so going much below 1m probably isn't worth it, thus we are left with just 8 points on our curve, which should give some idea of where the convergence point is. In fact for the smaller sets we can sample multiple sets of that size from within the total 100m to help squash the noise. So maybe we could get a couple more points, or a less noisy curve for the above 8 points.

I haven't done this yet but it would be fairly easy to do, the one area I'm fuzzy on is ensuring the sub-sets are statistically similar, but I don't think it matters that much, so long as the sub-sets aren't wildly dissimilar.

Whether it's best to use base-2 or not I'm not sure, we could take a base 10 logarithmic scale, but then we're left with sample points at 100m, 10m, 1m, 100k. Which is not really enough points to extratpolate a trend from, especially if there is significant noise in there. Base-2 seems like a good place to start.

:: HOME :: GET EMAIL UPDATES :: EMAIL ::

2007-05-28 6:12 PM Is the Netflix Prize winnable? Previous Entry :: Next Entry Mood: Contemplative Read/Post Comments (3) Share on Facebook	Over the last 3 months I've been working on (or competing in) the Netflix Prize (www.netflixprize.com). Actually I came to the party late as the competition started in October '06 and (alias)Simon Funk posted a blog entry describing his SVD (Singular Value Decomposition) technique in December '06. Simon's entry was something of a bombshell as most competitors, if not all, are now using some variation on it, at least in part. I'm fairly proud of my own SVD algorithm written in C#/.Net 2.0 as it crams the Netflix data consisting of 100 million ratings into just 275 MB, this packing of the data actually improves performance because each rating is used on each iteration(epoch) of the SVD alogorithm, thus it is effectively a memory bandwidth bound algorithm. As evidence of this I find that running two SVD algorithms together on a dual core machine results in both running at half speed (or thereabouts), basically because each core spends most of it's time twiddling it's thumbs waiting for data to arrive on the memory bus. This along with some knowledge of optimisation and cutting edge PC hardware (Intel Core 2 Duo) means I can perform one SVD iteration in under 2 seconds. Pretty darn fast. Even so my current setup requires at least 120k epochs to get a good result, and in fact some recent results are suggesting that more like like 200-300k+ epochs are in order to get the best possible result. So while waiting for one of these runs to complete I got thinking about whether this competition is actually winnable. It's a subject that has come up a few times in the Netflix Prize forums. Basically the goal is to get the root mean squared error (RMSE) of your predictions <=0.8563. Netflix's own Cinematch algorithm scores 0.9514 on the competition data, my own SVD algorithm scores about 0.9060 right now and the top entry on the leaderboard at time of writing is 0.8808, pretty good. As a reference point just using the average rating of each movie for all predictiosn gives an RMSE of abut 1.06 and some basic refinements on that approach (as outlined by Simon Funk) can get you down to 0.9840. So the whole competition is basically compressed into that tiny range between 0.9840 (the easy to obtain baseline RMSE if you like) and 0.8563 (the million dollar prize!). That's a range of 0.1277. In those terms the prize seems teasingly close, but of course the RMSE score is subject to the law of diminishing returns - expend twice as much effort to achieve half as much gain. So as the leading score inches forward it's easy to come to the conclusion that they are simply approaching some lower bound by smaller and smaller increments, and although near to 0.8563, never actually attaining it. OK but a couple of times over the (relatively short) life of the competition there have been significant leaps in best RMSE as new entrants have come in with new algorithms. So who is to say the lower bound is actually some way off where we are right now, and below 0.8563? So are there any ways of getting an estimate for where this lower bound is? Maybe. SVD old timers will be well aware of the problem of over fitting. Getting a much better RMSE on the training data set than on the probe set. Simon Funk introduced a 'regularisation' aspect to his algorithm to help alleviate this problem, but it is still very much present. Bascially Simon decays the SVD feature values by some small amount (generally referred to as K, a proportion) in the feature value update expression. Without this decay the probe RMSE actually starts to rise after a few features as the algorithm overfits to the training data. With a relatively small K of 0.015 each feature seems to last for more epochs before completing (no longer falling on the probe RMSE) and also gives a better probe RMSE at the end of each feature. Also the training set RMSE is actually higher as it is not overfitting quite so much. Discussions on the Netflix Prize forums mostly seem to be referring to K=0.015 as a good near optimal value, which I had taken at face value. Some recent experiments though suggest that a higher K gives better results, albeit at the expense of longer clock time to achieve those results (more epochs required per feature). K=0.025 seems like a better K to me right now, 0.05 at first sight seems to be one step too far as the resulting probe RMSE per feature is now higher than that obtained for K=0.025. But is 0.05 worse over the long term? My hunch is that higher values of K will result in runs taking longer and longer but ultimately reaching lower and lower scores. I already have some fairly minimal evidence to support this so lets run with it for now. Lets say that we can just keep on increasing K, eventually the plot of probe and traing RMSE might actually be very close to each other, the two plots will have (almost) converged. If we experimentally find where this convergence point is we might have found the lower bound for the competion RMSE, at least for SVD, and just maybe that lower bound will be <=0.8563, maybe. Hmm, maybe not. Consider this, the baseline predictions (as set out by Simon) give a probe RMSE of 0.9840. We know that K=1 will prevent any learning and thus the probe and training RMSE converge at 0.9840. Reducing K now results in the probe and trainign RMSE falling but separating. Thus our convergence point isn't between the probe and training RMSE we observe in 'normal' runs, it is actually above those values, at the baseline RMSE! Ok so lets take another line. SVD can achieve an RMSE of <0.8 on the training set with just a dozen features. We have approximated 100 million predictions with an RMSE of just 0.8, that's pretty good for 100m ratings. Presumably if we increase the number of ratings we train against that figure would go up, as overfitting would be reduced. If we double, quadrupled the # of ratings that RMSE would rise by less and less as it converged on (but never reached) the magic lower bound, but this time from below. We don't have more ratings (Netflix do BTW) but we should be able to get a good idea of the relationship between # of ratings and achievable RMSE by running SVD with fewer ratings and extrapolating. Expect the curve of RMSE achieved by each subsequent increase in # of ratings to plot an exponential curve, doubling the # of ratings each time results in an increase of RMSE that halves each time, something like that. So for number of ratings we could use... 100m 50m 25m 12.5m 6.25m 3.125m 1.5625m 0.78125m I expect that the smaller sample sizes will be noisier, so going much below 1m probably isn't worth it, thus we are left with just 8 points on our curve, which should give some idea of where the convergence point is. In fact for the smaller sets we can sample multiple sets of that size from within the total 100m to help squash the noise. So maybe we could get a couple more points, or a less noisy curve for the above 8 points. I haven't done this yet but it would be fairly easy to do, the one area I'm fuzzy on is ensuring the sub-sets are statistically similar, but I don't think it matters that much, so long as the sub-sets aren't wildly dissimilar. Whether it's best to use base-2 or not I'm not sure, we could take a base 10 logarithmic scale, but then we're left with sample points at 100m, 10m, 1m, 100k. Which is not really enough points to extratpolate a trend from, especially if there is significant noise in there. Base-2 seems like a good place to start. Read/Post Comments (3) Previous Entry :: Next Entry Back to Top
:: HOME :: GET EMAIL UPDATES :: EMAIL ::