Home, Hearth & Heroes
My Builds My Blog My Tiers
What Statistics and Machine Learning Can Teach Us About The Gameplay Changes in Heroes of the Storm
Before we start, let me introduce myself: I'm ghostDunk, I've been playing since the Kharazim patch, I've been involved in the amateur scene both casting and playing, and I have probably met a few of you. In this context, though, I have previously done mathematical work in my post analyzing whether good players can carry bad players, and how much hero level matters.

Heroes of the Storm has changed quite a bit over the years. Scaling changes, ammo and minion changes, to the infamous XP changes, typically the biggest changes have not involved hero balancing, but the subtle foundations of the game itself. These changes are invisible to most people (I had someone in my Team League tell me that forts were worth 1800 XP just last week), but they are analyzed and picked apart by anyone playing the game at a competitive level. But nothing beats facts, and very little beats science. So let's do some science!

(You'll want to read all of it as I saved my most interesting data for last!)

Science With Hotslogs Data

Hotslogs puts out 30-day data dumps of the previous month's data (so 30 days of data starting 60 days ago). When I returned home from Blizzcon, CavalierGuest requested that we do some analysis pre- and post-XP changes, because of the very obvious debate over those changes' impact on the game. (He will have his own follow-up article using some of the data I gathered.) I was careful to capture dumps with game data from October 12 through November 11 (pre-XP), and then from December 11 through January 10 (post-XP). While doing analysis, I realized that I had actually saved data from before the Blizzcon 2017 changes! If you've forgotten what they were, here's the blog post from the devs. They include ammo removal, minions, et cetera. What's more is that the game has given me a natural control group between the pre- and post-XP changes in Towers of Doom, where the map actually functions identically between patches!

So I fired up my python editor and looked through the data, which includes map, game type, MMR for each player, all TAB screen stats for each player, as well as some more (time dead, self-heal). Let's get started with what I found out!

(Apologies to color-blind readers. It takes a long time to create these graphs, and color changes are not easy. If you can let me know which graphs are hard to read, I can edit them manually and repost them here.)

Hotslogs replay uploads have decreased drastically

The very first thing I noticed was how much...smaller the data dumps were. It's not surprising that hotslogs uploads have dwindled, seeing as how they've slowly backed away from constant communication with the community, and newer players are less likely to upload their replays. That said, I wasn't expecting it to fall quite so drastically:

...but not quite as drastically for draft modes

But don't take this data as evidence that the player base has dwindled by half. If we split up total uploads by Quick Match and Draft Modes (Unranked, Hero League, and Team League), we find that draft mode uploads have not lowered quite as drastically. This means the majority of missing uploads are from Quick Match; draft mode players are more likely to be still uploading their files.


By the way, I'm going to be splitting the data like this quite often, between Quick Match and Draft Modes. Restricting analysis to Draft Modes has the advantage of sample size. Machine learning requires lots of data points, and if the sample size of groups we're comparing is closer, that means our analysis will be more fair.

However, we don't need machine learning to get some basic facts about how the game has changed.

XP changes made games more even, but not quite as good as it used to be

The first thing I looked at was the most obvious impact of the XP changes on the game -- XP levels. Part of the reason the changes were made is because people were frustrated with being so far behind it felt impossible to come back. I looked at the level difference between the winning and losing team, and made a graph showing how many games ended at which level lead, by percentage, for October 2018 (pre-XP changes):

The first thing we can see is that Quick Match games actually fared better than draft modes. One might expect Quick Match to simply be a clown fiesta where stomps were more likely, but instead they are on the whole more competitive than in ranked/unranked. Then, I compared it to level differences post-XP changes:
2019
Here we see that the changes did have the desired effect of lowering the number of stomps -- there are significantly less high level-lead games. Also, there is a drop in the number of games where the team with less XP wins.

At this point in my analysis, I thought "well, they accomplished their goal at least." But then I realized I had the old 2017 data, and here's where things got interesting:
2017
That's right, the game had even more close matches back in 2017, before they introduced the Blizzcon 2017 changes (ammo, minions, et cetera). The high end of the spectrum isn't quite as good, but that is a trade-off for a lot of close games and comebacks. In my opinion, I prefer this graph.

Yes, you read that right: Blizzard could have reached their goal of "closer games" simply by reverting the Blizzcon 2017 changes.

How does skill level affect these numbers?

Then, I split the data up into MMR bands, to see how skill level affected these numbers. I made separate graphs for Quick Match and Draft Modes.
As your skill level declines, the more likely you are to get stomped. This could very likely be due to the fact that less skilled players are more likely to feed trying to come back in the game instead of being patient and waiting for an opening. However, notice the 2017 data: the distribution is even across skill levels. I would argue that this is evidence that the game's win conditions were more evident to players of all skill levels on the 2017 patch. This is more evidence that it's possible what Blizzard really wanted was to go back to how the game felt in 2017.

Control Group

Remember when I said we had a natural control group in the form of Towers of Doom? Let's look at the same data pre-ammo, and pre- and post-XP changes to see if they are, in fact, the same. Perhaps lower population changes this?
 

The data looks pretty much the same here. There are some differences at the top and bottom, but because these categories have fewer games, we can expect a little bit of statistical variance. Even if you take these differences at face value, I can't come to any real conclusion based on these two charts, and my interpretation is that the level differences are entirely due to the XP changes.

Is Quick Match different?






(Note: the MMR bands are different because Quick Match MMRs are differently distributed, and the extremes didn't have enough games to be meaningful.)

We do see a slight difference here -- the 2017 distribution is a little more meaningful, but still not nearly as skewed as the 2018 and 2019 numbers. But in my opinion, the biggest difference in these graphs is that the level lead improvement that the XP changes were supposed to fix is all happening at the mid to high skill levels -- the lowest skill levels remain untouched.

These changes were made to appeal to the everyday player, not the top. The fact that the most popular game mode has not improved for the lower skilled players tells me that the XP changes are not succeeding as well as they should be.

What about MMR differences?

My next step was to instead split up the data by MMR difference between each team. To calculate the MMR for a team, I used the quadratic mean of each player's MMR, something I showed in my previous research to be more accurate than the arithmetic mean, aka "average". (Does this assumption hold up for newer data? Spoiler: yes)

The right side is the same as the level difference chart, but only if the higher MMR team won. The left side is the level difference chart, but flipped, and only if the higher MMR team lost.

October 2017

2017

October 20182018

December 2018 - January 20192019

Level differential by map

Here's the average level of each team, winning level at the blue dot, losing in red, and the bars represent one standard deviation.

October 2017

2017

October 2018

2018December 2018 - January 20192019


Matchmaking is actually pretty good

Ask most players what the biggest issue with Heroes of the Storm is, and you'll more often than not hear a screamed response, "MATCHMAKING!!!". Your blood may already be boiling right now, so let me clarify what I mean by "good". The primary goal of any matchmaker is to make sure that both teams have an equal chance of winning; that the odds are even, 50-50. What this doesn't cover are the dreaded "rainbow" matches, where the game feels worse because a good player has to put up with obvious mistakes from a worse player. Even though they have just as good of chance of winning, and even I'm about to show that a good player in a rainbow match has the ability to carry the game, it feels bad. We play this video game because it's fun to operate as a team, and players from different MMR levels typically have a harder time getting on the same page.

But enough opinion, let's look at some facts.

Matches look as though they've gotten more lopsided when looking at Hotslogs data

This was an easy bit of analysis. I plotted the cumulative MMR difference against percentage of games. This means that the graph starts at 1 (for 100% of games), and as the graph moves to the right, it shows the percentage of games played on the left axis where the MMR difference between teams is greater than the MMR shown on the bottom axis. So in this first graph for October 2017, 20% of matches have an MMR difference of over 200 points.

Plotted along with this is the win rate of the higher MMR team, shown on the right axis. So, in this graph, matches where the MMR difference between teams is over 200, the higher MMR team wins about 58% of the time.

October 2017 Draft Modes2017

October 2018 Draft Modes2018December 2018 - January 2019 Draft Modes2019October 2017 ALL Game Modes

2017 all modesOctober 2018 ALL Game Modes2018 all modesDecember 2018 - January 2019 ALL Game Modes2019 all modes

I prefer looking at the draft mode versions of this graph, since Quick Match matchmaking also involves looking at your hero level (something the devs have explicitly said). It's hard to tell looking at each graph individually, but I combined these graphs.

Ideally, these graphs would look the most like an "L" shape -- more games with closer MMR levels. In this graph, Blue Solid and Yellow Dashed is October 2017, Purple Solid and Green Dashed is October 2018, and Red Solid and Blue Dashed is January 2019.

Yes, this could be clearer, but I already made the damn thing and I don't want to redo it.

Combined Dates Draft Modes

The point here is that, if you look at hotslogs MMR, it looks like matchmaking has gotten worse. But is this a factor of Blizzard's matchmaking or less people uploading their matches? When less people upload their matches, MMR becomes far more uncertain, especially with players the database has only seen once.

Now let's bring in the big guns: machine learning.

...but MMR is very, very bad at predicting matches

First, a quick machine learning primer

To understand the graphs I'm about to show you, you need to understand some machine learning concepts. Don't worry, they're actually pretty simple. Also, huge shout out to H2O, I couldn't have done this project without this software, and I highly encourage you to try it out if you're interested in machine learning.

Training Data and Validation Data

In order to make our machine learn anything, we need two things: a model and data. Think of the model as a mathematical equation that we will change as we learn. The data comes in the form of a series of inputs to that equation, and the correct outputs. In our case, the only output we ever care about is if a team won or lost. Once we have collated all our data, we feed most of it to train the machine. The computer takes each new piece of data, puts the inputs through its current equation, and uses the correct answers it to adjust its calculations, thereby learning from the data. It's a little bit like guessing at a math problem using a calculator. Once the computer has gone through all of the inputs, the model -- that is, the equation that we end up with -- is a set of rules that we will use to make predictions.

However, in order to make sure we don't overlearn and make rules so crazy they don't fit real life, we save some of our data to validate our calculations. This is called overfitting, and having some data set aside for validation prevents us from doing so, and makes sure that the model we've built actually predicts things well.

Confusion Matrix

This is just a table of how well the model performs on a set of data. Here's a confusion matrix for a binary output like win/loss:
  Win Loss
Win 60 9
Loss 4 20

And here's what the table means:
  Win Loss
Win We guessed "win" and the correct answer was "win" We guessed "win" but the correct answer was "loss"
Loss We guessed "loss" but the correct answer was "win" We guessed "loss" and the correct answer was "loss"
The top left and bottom right boxes are good guesses! The bottom left box is what's called a "false negative". The top right box is called a "false positive".

False Positive Ratio and True Positive Ratio

Now that we have these numbers, we want to look at two in particular: the false positive ratio and the true positive ratio. So in our example, the false positive ratio is just the number of false wins (top right box) divided by the total number of times we guessed "win". In this case, it would be 9/(60+9) = 9/69 = 0.1304. The true positive ratio is the number of times we guessed "win" correctly vs. the number of times the correct answer was "win". In this case, it would be 60/(60+4) = 60/64 = 0.9375.

Binary Classification and Thresholds

When we plug numbers into our machine, we don't just end up with "win" or "loss". We end up with a probability that a team won. The threshold is basically where we decide at what probability we want to guess "win" instead of "loss". Normally you might think to just do this at 50%, but it turns out that models almost never perform the best at that threshold. To choose the best one, the computer looks at a bunch of different thresholds and calculates the confusion matrix for all of them. There are a million different ways to then score a confusion matrix to the best one, but H2O uses the "F1" score as a default. (You don't need to know what that means, just know it's a good choice.)

ROC Curve

Now that we have all those basics down, let's get to the point: an ROC curve is what you get when you create a confusion matrix for a bunch of different thresholds from 0 to 100% (between 0 to 1 to the computer), and then graph the false positive ratio to the true positive ratio.

Here's an example of one from our data:

qmmr roc
The red line is to denote the worst possible model, a model that is equal to random chance. The best possible model would look like a line going straight up and to the right -- because the model is perfect, at some threshold the model is either entirely wrong or entirely correct.

AUC - Area Under Curve

This is just a measurement of how much space is underneath the ROC curve. The worst possible model will have a measure of .5, and the best possible will have a measure of 1

Models

So now that we know how to measure a model, what actually is a model? Well, H20 is so amazing, it can try a bunch of different models, and let you know which one performed the best. The kind of model that ended up the best were algorithms based around decision trees. Here's a not-too-oversimplified version: the computer gets data in, and at each node it makes a simple decision, for example: "is x bigger or smaller than 1?" Then the next node it says "is y bigger or smaller than 2"? Then it spits out a probability of whether a team won or lost depending on what node it got to. The computer will end up with a bunch of these nodes (my examples ended up with trees with about 100 or 200 different nodes), and there are different techniques to connect them, and then alter their decisions when it finds out that the answer was wrong.

Point being: this kind of thing allows us to look at the model the computer found was best, and discover what input has the most impact on the computer's eventual decision!

Now we can move on to MMR.

Finally, talking about MMR

My methodology for looking at MMR was easy: look at a team's combined MMR vs the enemy, and predict if the team won. But how to calculate a team's combined MMR? After all, maybe a good player has more influence on a team? Maybe a bad player will drag a team down? I decided to take the same approach as my previous research and use a generalized mean. You don't need to understand the math, you just need to know that you pick any number n. An n of 1 is the same as taking your standard average. The higher n goes, the more it weights the best players. The lower n goes (even negative), the more it weights the worst players. I used the October 2018 data set because I didn't have time to run it against any other data set.

I just created combined MMR scores using n from -5 to 5 and waited to see what the machine spit out.

Well.

qmmr roc


qmmr conf

This sucks. At the best threshold, the computer was not able to get better than 50%, if you look at the curve, it's barely better than random, and the AUC is .55, just barely above the worst possible value of 0.50.

Looking at the confusion matrix, I thought it humorous that all it does is guess that team 1 wins, except for sometimes, and that's the best it can do. I thought that maybe the data set might not be balanced, so I ran the machine again, this time with just doubling the data, but inverting the MMR and the win/loss, so team 2 got the love this time.

Oops. We did worse.

What's the point here? This proves that matchmaking is pretty decent. It would be ridiculous to claim here that, actually, there is no difference between good and bad players, you imbecile, you fucking moron. No, obviously MMR still works, and good players are better than bad players. What's happening here is Blizzard is making sure each team has a fair shot. The fact that we can't predict very well what will happen even given players' MMR says that Blizzard has already done the work in making sure the game is at least fair, even if it isn't fun. However, we can still make some predictions, and that does give us a little insight:

Better players still have a bigger impact on the game than worse players

Don't forget that I had 11 inputs to each of these. H2O tells me the relative importance of each of these inputs, so let's look at that graph for our first MMR model:

(Note: the variables are marked "mmrX" where X is the coefficient used for the generalized mean. "mX" means "negative X". So the higher X is, the more good players are considered, the lower it is, the more bad players are considered.)

Results are pretty obvious here: combining a team's MMR and weighting the best players is clearly the superior technique -- this chart shows that those values are the most predictive in this model. What about other models, though? Maybe other models might have a different story? Well, turns out, no. Every model I looked at showed that better players have more impact on predictions than worse players. Here's another model from this run:

The "reinforced" data I used above yielded the same results:

To test this further, I altered my data to only look at games that were more "rainbow" -- I whittled down the data set to matches where changing the coefficient mattered more. Here is the ROC curve for the best model:

Looks like it performed much better! Wonder why...

(Orange is negative.)

Here are the variable importance charts for some other models. You'll have to take my word that every model I looked at placed higher coefficients as more important.


 

XP changes have made the game less dependent on performance

That was just a warm up for the real analysis. The goal here is to look at the actual stats in from the game and see if we could determine a winner. Then we could also look to see which stat was more predictive of a team's win. Here's the result from the October 2017 data set (pre-ammo changes):

(A quick note: this is a graph showing the process of the computer learning. "logloss" is an alternative measure of success that basically calculates how much our guesses suck. You can see as the computer adds more trees, our guesses suck less. The blue line is how well the model is performing against the validation data.)



It's not surprising that we are able to predict who won in a match much better by looking at the results of the match, as opposed to the player's MMR before the match. Our AUC is pretty good, 0.9847.

If you ever wanted proof that damage stats are bullshit, look no further. The biggest indicator of success in October 2017 is how much XP your team has. It's more than siege damage (which includes damage to minions), or any stat like that. Time dead far outweighs killing the other team, too.

October 2018




Once again, the results are mostly the same, but only now in the pre-XP patch, matches are even easier to predict, mostly from looking at XP, time dead, and kills.

But what about after the XP changes...




What a difference.

The first thing that jumps out at me is how poorly the scoring did against the validation data set. The fact that it kept getting worse tells me that even the best model the computer could make was still overfitting. Also, the AUC is noticeably lower than the other models. It's still high, but we aren't as good at guessing winners anymore. Here's a chart of log loss and mean error, both calculations of how bad we are at guessing. Lower numbers are better:



Here's my thesis:

The XP changes made the game less dependent on what the players are doing for most of the game.

The fact that the model starts overfitting and has a lower score when predicting data tells me that the game is basically more random than it used to be. Time dead is now by far the most important variable in your games, which says to me that because late game death timers are longer, late game team fights are pretty much the only important thing in the game now. If early game kills mattered, or if the passive XP boost the team gave mattered at all, XP would be more important to these calculations.

Of course, that's just my interpretation. But let me show you the result of doing this test on our control group:

Towers of Doom Only Pre-XP Changes


Towers of Doom Only Post-XP changes



The results are nearly identical.

However, I wanted to double check to see if maybe post-xp changes were having a harder time guessing winners because there were so many blowouts. So I re-ran the tests using only games that ended only 1 level apart or less:

October 2017 close games




October 2018 close games



December 2018 Post XP close games




To my eyes, these results look the same, but this time, the post-xp analysis is even worse in comparison to both the pre-xp and pre-ammo data sets.

tl;dr

* Less people are submitting to hotslogs, but not quite as bad for draft modes
* XP changes fixed getting stomped, but Blizzard could have done the same thing by reverting the "More Meaningful Early Game" changes from Blizzcon 2017
* Matchmaking only sucks because you have to play with people in different MMR groups, but it doesn't make you lose
* If you're the best player on your team, you still have the most impact on winning; a good player is more likely to lift up their team than a bad player is to drag them down
* XP Changes have made the game measurably more random
* Bring back the way Heroes was in 2017, it was far more stable
* We proved this shit with motherfuckin math, yo

I'm sick of typing and I want to get this blog post out tonight so I can refer the balance devs to it tomorrow. If you have any questions, feel free to ask, either on the reddit post or here, or in the Nexus or Discord or wherever. I'm more than happy to show people source code or data sets if they would like; just know that all the ML stuff was done through the H2O Flow UI and not in code.

Thanks for reading and even if I'm pessimistic, I still love playing this game. See you in the Nexus!
Comments
There are no comments for this post.
Why do I think your conclusion about "MM is actually good and fair" are wrong:
For example we have 2 teams with avg MMR = 2000. But if we take a closer look at individual players we have 1 team with every players at around 2k. And another team with 1 player 3500 MMR and everyone else 1000. To make it worse: higher MMR player was forced to pick a support role. In raw numbers we get 2 teams with equal MMR fighting each other and slight difference within 200 MMR. What team is likely to win? Actually that's doesn't matter. What matter is: this is why "everyone" bullshits MM in Hots. AVG MMR doesn't matter when it forces people to play with others who doesn't belong to this rank. This is why Masters-Grand Masters means no shit when anyways you play with Diamonds.
I know Oskar is probably just trolling, but it doesn't appear that he even read the article. The author didn't use the average MMR. He used the quadratic mean of each player's MMR, which does in fact weigh each individual player's MMR in calculating the team MMR. The math can be found here - https://heroesmodelling.wordpress.com/2015/09/27/how-should-individual-mmrs-be-combined-into-a-team-rating/
Summary: It is not the case that game outcomes are determined by which team has the “least bad” players. The outcome depends on all the members of the team, and is actually...
While I appreciate the effort you put into this, I'm going to strongly disagree that "If
you're the best player on your team, you still have the most impact
on winning; a good player is more likely to lift up their team than a
bad player is to drag them down." Regardless of whether or not I'm a
good player or a bad player, these exp changes make it extremely hard to
make a comeback when your team has even one player who is playing
rather poorly. When you have a player on your team who just doesn't
care, is a leaver bot, or is deliberately being toxic, it doesn't matter
how good you are compared to the rest of your team. The enemy team is
so far ahead on kill exp that taking down a fort or a keep no longer
gives you a real opportunity to make a come back. Since you now only get
passive exp over time for a fort take down instead of earning it
instantly, getting a building still leaves your team on the back foot
for the entire rest of the game. Which means players will have to secure
a few kills while already down a talent tier if they expect to actually
catch back up to the enemy team. Playing it safe and laning wont make
up for the fact that the bad player on your team has had 10 deaths in
the past 12 minutes. This is the biggest beef I have with these changes. If you end up falling behind because of a bad player on your team, you are very likely to remain down the entire rest of the match. And its stupid.