Thinking as a Hobby

I hadn't even heard of this. Basically, Netflix wants to improve their movie recommendation system, so they've put out a $1 million prize to anyone who can improve the prediction error of their current system by 10%. Yearly prizes of $50,000 go to the current leader if the 10% mark hasn't been reached. They just awarded the $50,000 to a team which got some help from AT&T, who say they'll donate it to charity.

I downloaded their dataset, which is impressive. It includes over 100 million ratings, spanning a 7-year period, from 480,000 customers. That's the training set. So you train up your system, and then evaluate the test set (which includes about 30,000 customer/movie pairs). The task is to try to predict what ratings the customers will give the movies.

You're given a movie data file, which includes the ID, year, and title of the movies. My first impression was that the movie data would need to be enriched. That is, knowing that someone gave Hang Em High a rating of 3 and Toxic Avenger a 5 gives you some information, but not a whole lot. The only features you have to correlate with the reason for their choices are the years the movies were made. However, if you beefed up the movie data by including information about genre, lead actors, director, running time, and similar information relevant to deciding quality, then you'd have a lot more information to try to decide the features upon which the customer gave their rating.

But there are over 17,000 movies, and augmenting the movie data by hand would be a real pain in the ass. You might be saying, "Whoa, they're giving you an embarrassment of riches in terms of data...why the hell would you want more?" The features relevant to making certain ratings is probably going to be implicit in the data set. That is, if a subset of people tended to give every movie starring Clint Eastwood a higher rating than the bulk of users, because they happen to be big Eastwood fans, your system should be able to cue into that statistical regularity, even though your system is not getting any explicit information about who stars in each movie. But if that information were made explicit, the statistical regularity should be able to be detected more easily and more reliably.

I've seen other people talking about the prize, complaining about the noise in the data set (e.g., one customer apparently rated nearly 2,000 movies and gave then all a 1). Actually, that's a case which makes prediction easy (I think you'd predict that this person would give a 1 to the next movie they rated). But there will be noise, since people may not be very consistent in their rating choices. But my guess is that there is actually less noise in a data set like this than in many other kinds of data.

Anyway, if there are any enterprising programmers/machine learning/statistical analysis types out there, you could win a cool million if you know what you're doing. Since I'm trying to model cortical development and function, especially in terms of inference and prediction, this might be an interesting task, though I'm not sure I want to sink the time into it. You never know, though...




	Thinking as a Hobby Home Get Email Updates LINKS JournalScan Email Me Admin Password Remember Me 3478427 Curiosities served Share on Facebook				2007-11-15 9:21 AM The Netflix Prize Previous Entry :: Next Entry Read/Post Comments (0) I hadn't even heard of this. Basically, Netflix wants to improve their movie recommendation system, so they've put out a $1 million prize to anyone who can improve the prediction error of their current system by 10%. Yearly prizes of $50,000 go to the current leader if the 10% mark hasn't been reached. They just awarded the $50,000 to a team which got some help from AT&T, who say they'll donate it to charity. I downloaded their dataset, which is impressive. It includes over 100 million ratings, spanning a 7-year period, from 480,000 customers. That's the training set. So you train up your system, and then evaluate the test set (which includes about 30,000 customer/movie pairs). The task is to try to predict what ratings the customers will give the movies. You're given a movie data file, which includes the ID, year, and title of the movies. My first impression was that the movie data would need to be enriched. That is, knowing that someone gave Hang Em High a rating of 3 and Toxic Avenger a 5 gives you some information, but not a whole lot. The only features you have to correlate with the reason for their choices are the years the movies were made. However, if you beefed up the movie data by including information about genre, lead actors, director, running time, and similar information relevant to deciding quality, then you'd have a lot more information to try to decide the features upon which the customer gave their rating. But there are over 17,000 movies, and augmenting the movie data by hand would be a real pain in the ass. You might be saying, "Whoa, they're giving you an embarrassment of riches in terms of data...why the hell would you want more?" The features relevant to making certain ratings is probably going to be implicit in the data set. That is, if a subset of people tended to give every movie starring Clint Eastwood a higher rating than the bulk of users, because they happen to be big Eastwood fans, your system should be able to cue into that statistical regularity, even though your system is not getting any explicit information about who stars in each movie. But if that information were made explicit, the statistical regularity should be able to be detected more easily and more reliably. I've seen other people talking about the prize, complaining about the noise in the data set (e.g., one customer apparently rated nearly 2,000 movies and gave then all a 1). Actually, that's a case which makes prediction easy (I think you'd predict that this person would give a 1 to the next movie they rated). But there will be noise, since people may not be very consistent in their rating choices. But my guess is that there is actually less noise in a data set like this than in many other kinds of data. Anyway, if there are any enterprising programmers/machine learning/statistical analysis types out there, you could win a cool million if you know what you're doing. Since I'm trying to model cortical development and function, especially in terms of inference and prediction, this might be an interesting task, though I'm not sure I want to sink the time into it. You never know, though... Read/Post Comments (0) Previous Entry :: Next Entry Back to Top