Methodology

Recommend Steam users the games they really want

The problem


Adam has just spent his last 14 hours on Half-Life 2. In fact, he has bought all 9 titles from the series and has spent 2000 hours of his entire life on it. Although he really enjoys interacting with other Half-Life users on the Steam website where he has purchased his game from, he can't help but wonder whether he has missed out from games other than Half Life.

Naturally, he went back to Steam's website for clues to his next favourite game. He took a look at Steam's page for Half Life 2 and clicked on "More Like This Game". The page returned 12 games, of which 6 of them belongs to the Half-Life series. "I already own them", murmured Adam. Of the remaining 6, he is only interested in Counter-Strike and the Portal series. "This list certainly doesn't look interesting." To his frustration, there isn't a next page button anywhere to load more recommendations. He thinks to himself, "Ah well... shame that I have loads of cash to spend".

This would have been a different story for Adam, if he has come to this website and login with his Steam Account. he wouldn't be recommended with games he already owns, and I bet he would probably also like the recommendations from the improved algorithm , such as Team Fortress, Richochet, Day of Defeat series, Deathmatch Classic and Left 4 Dead too, which were not available on Steam's list of recommendations .

At the time this article is written, there are 75 million active Steam users and have at most 8 million concurrent users at a give time . Adam's experience is certainly not an isolated case; every Steam user I came across seem to have the same issue too. I have therefore decided to find out how I can solve this issue using data.




The data


Till date, I have been able to obtain ownership information and playtime statistics of around 6 million Steam users, who has set their profile to "public". The data contains which titles each player owns, how long players have spent on these games throughout their lifetime and the past two weeks.

In order to download these users' data via Steam's web API interface, their user IDs are required. I have therefore written a Python script to crawl through user IDs by looking up friends of specified users, and another script to download their data via the API. Only Steam account users can request API access keys, and each access key can only make 100,000 calls per day.

As Steam did not provide a web API for information about the games it sell on its website, another script has been written to scrape its Steam store website. Any titles that are not contained in the list of 3,300 games are excluded from my dataset, e.g. titles classified as Demos, Downloadable Content, Mods, Packs, Software etc. All Early Access games are therefore not in scope of this project.

To compare the effectiveness of my algorithms against Steam's current algorithm, I have also scraped the list of 12 suggested games that Steam's recommendations engine currently produces for each of the 3,300 games. Please see next section for more details on evaluation criteria.




Evaluation criteria*

(*see Other Notes for further details)


It is easy to remove recommendations that players already own, but it is challenging to evaluate whether the new algorithm is better than the existing one.

Since this website originates from the final project for attendance at Data Science Retreat (a 3-month intensive course in data science in Berlin), and since this project is entirely self funded, online experiments are considered infeasible due to resource and time constraints; qualitative results from focus groups do not fit our purpose, as an objective quantitative measure is required in this case.

By elimination, offline studies are most appropriate in this situation. Since the user data does not contain any game ratings, Steam's recommendation system is considered "content based". According to G. Shani and A. Gunawardana in their research paper entitled "Evaluating Recommendation Systems" , the best measure to use in this case is "Precision at N", where N = 12 since Steam's recommendation system only produce 12 results for each game, and their list of recommendations provide a basis for comparing effectiveness between different algorithms.

What is "Precision at N=12"? Put simply, it tries to estimate in percentage, how many of the top 12 recommendations on average are considered relevant to users. Using part of the data as unseen test set, and assuming users in the test set are interested in the games they already bought, each recommendation algorithm will provide 12 suggestions per user for each game they own. We obtain "Precision at N" by comparing how many of these suggestions match the titles that people actually own in the test set.

The dataset has been divided into training, test and holdout sets. To prevent bias, users with user id number ending in 4 or 9 goes to the test set, and user ids ending in 5 or 0 belongs to the holdout set. This means that 60% of the data (around 3.5 million users) remains in the training set.

Based on the above methodology, Steam's own recommendation system has a "Precision at N=12" score of 27% for both test and holdout sets. Will other algorithms beat this?




Algorithms tested

Here is one thing I have learnt in DSR:

“If you are not embarrassed by the first version of your product, you’ve launched too late.” – Reid Hoffman

Till date, since it is a final project within DSR with a project timeframe of 5 weeks, I have chosen to test two of the recommendation algorithms which are considered quicker to build, less computationally expensive with higher scalability, built entirely using Python and techniques such as sparse matrix multiplications. In the later sections, it has been found that both of them has brought significant improvements to the existing Steam recommendation engine. The effectiveness of other more complex algorithms can be explored after the first version.


Pearson R Correlation


The correlation between the time spent on one game vs time spent on another by each user has been calculated. 4 versions of the Pearson R correlation has been tested:

1. Correlation based on the time spent on each game by each user throughout their lifetime.

2. Same as the above, excluding results where there are less than 5000 users owning both game X and game Y.

3. Correlation based on the time spent on each game by each user only in the past two weeks (when data is downloaded).

4. Same as the above, excluding results where there are less than 5000 users owning both game X and game Y.


Log-Likelihood Ratio


Also integrated within Apache Mahout to calculate item similarity scores, LogLikelihood Ratio is a helpful measure to judge which co-occurences are "anomalous" enough to suggest there is a strong relationship; in other words, the more the number of users that owns both game A and B, the more we cannot consider a mere coincidence that people are owning both games together, and LLR attempts to quantify this relationship with a measure.


Game A Users that own games other than A
Game B Owning games A and B together (k_11) Users that own B, but not A (k_12)
Users that own games other than B Users that own A without B (k_21) Users that own neither(k_22)

LLR will turn the statistics in the above table with a score to measure the strength of the relationship between games A and B. In the following example, it will flag up the scenario in the bottom right hand corner:


Log Likelihood Ratio demonstration

Ted Dunning has explained in his blog in details how this score is computed.




Results


Steam's own recommendation system has a "Precision at N=12" score of 27% for the test set. Both LLR and Pearson R correlation beats Steam's recommendation system if the right parameters are specified.

LLR algorithms, either based on all games that users own or games that users have played in the last 2 weeks, have precision scores of 44% and 39% respectively.

In the meantime, Steam's recommendation system beats most of the Pearson R algorithms by 4% or more. However, after filtering out combinations of games where there are less than 5000 players in the past two weeks, the performance of Pearson R correlation has drastically improved from 17% to 44%, which slightly outperformed both of the LLR algorithms when decimal places are considered.

By mixing the scores of the two LLR algorithms and the Pearson R correlation based on past two weeks' playtime statistics, it improves the Prevision at N=12. Different weights have been assigned and tested, and the best weights optimises results at 45%. This is a whooping 18% difference over Steam's own recommendation system.

Results have been validated against the holdout set, and the differences between the Precision at N scores from the test and the holdout sets are found to be negligible.

In short, the algorithm on this website takes into account users' game ownership patterns and playtime statistics of the past two weeks.



Further work


The methodology evaluates recommendation lists based on a single game that user own, but it fails to evaluate recommendation lists based on users owning more than 1 title. At this moment, this website provides recommendations simply by aggregating the scores of each recommendation by each title that users own. If a user owns all 9 titles from the Half Life series and all 4 titles from Counter Strike Titles, the recommendations that this site generates will be skewed towards Half Life, even though the user may personally prefer Counter Strike over Half Life. Online studies may be required to test out different approaches in aggregating the results.

Effectiveness of other recommendation algorithms are to be explored at a later date.

As the recommendation system partly depends on user playtime statistics in the past two weeks, a script is still yet to be written to update this website every 2 weeks.




Other Notes


1. There are some issues using "Precision at N=12" directly. For instance, if a Steam user only owns 4 games, even if the recommendation engine recommends all the right games, the average precision at N will be only (4-1) / 12 = 25%. Adjustments have therefore been made to the "Precision at N" measure, so that it will take into account how many games users own, i.e. the "Precision at N" measure in the example will now become 100%.

2. Some 300 games have been excluded from the comparison because their recommendation lists contains Early Access and Demo games, which are considered out of scope in this project.

3. Games, on which users spent less than 30 mins, are excluded from the calculations of Pearson R correlation based on past two weeks of playtime statistics. This has found to improve the performance of the algorithm. Further work is required to optimise performance by modifying this parameter.




Powered by Steam. This site is not affiliated with Valve, Steam, or any of their partners.
All copyrights reserved to their respective owners.
© Kevin Wong 2014. Background images from Deus Ex: Human Revolution.