This article is divided into three parts. In the first part, I will provide a practical understanding of Armaskill and define the scope and goals of my further analysis. In part two, Armaskill and other statistics will be analyzed. In part three, I will introduce and explain a new recreation of Armaskill, which I have of course named after myself (always name new things after yourself, they might catch on someday!). It’s a long article. I’ve kept it light on the math. Enjoy.
Part 1 – Introduction
dlh maintains a set of statistics for several servers at Gridstats.com, and a couple years ago added a player evaluation system that Microsoft developed for its XBox Live system. Trueskill or, for our purposes, Armaskill is in its most basic form an adaptation of an Elo rating to games with more than two players. In a fortress match, each player would enter with a Armaskill rating. If they win the match, it goes up. If they lose, it goes down. The collective weight of their opponents initial rating as compared to their teams determines how much it goes up or down. If a bad team loses to a good one, the respective Armaskill ratings of the team’s members do not change all that much, since we haven’t learned anything new.

Figure 1a: In this simplification, each match performance is graded from 1-9, and then stacked according to its frequency. The average performance will be where the graph peaks.

Figure 1b: Smoothed over many matches, the distribution approaches some function of the mean and the standard deviation
Armaskill isn’t quite as simple as tracking a single rating, like is done for Elo. A further improvement Microsoft made was to represent players with two numbers, rather than just one. A player is represented by their mean skill and by a standard deviation [Figure 1a, 1b]. Each individual performance is thought to be an independent, random variable, and so the collection of these individual performances over a large number of matches should fall into some normal distribution. The Armaskill rating incorporates both the mean and the standard deviation, and for the purposes of this study, only the rating is important.
[A note on standard deviation: For the practical purposes of this article, think of standard deviation as representing the spread of a statistic. The higher the number, the more broadly spread the thing is. A standard deviation of 0 would be like tossing a one sided coin. I'll often normalize standard deviations as a percentage of the mean. This just means we can compare the spread across different types of statistics.]
It is important to understand what events in a match factor into Armaskill. The only event that Armaskill cares about is the match outcome, win or lose. Nothing else that happens is taken into account. The individual player could die every round, all Armaskill knows is whether his his team wins.
Essentially, Armaskill is a black box. It has inputs into a match, and it gets outputs, but it can’t see into the box. It can’t see into a match and describe anything about what is happening there. Over the course of many many matches, this black box works very well. The players involved in each match are the variables, and for each player’s rating, he himself is the constant, while all his teammates and opponents change. Holding him constant, we get a good sense of how good he is based on the effect he has on a match. It’s not perfect, but it’s pretty good.
In fact, it’s the best we can get without making value judgments. The nice thing about Armaskill is that all it cares about is match outcomes. If we were to try to improve on Armaskill, we would have to use our understanding of fortress to place value on various events. How much is a core dump worth? How bad is it to be killed? What about cutting? What about holing?
In the first place, a lot of these events don’t show up in the statistical record. Even if they do, there’s little context. Killing a very bad opponent doesn’t mean as much as killing a very good one. Getting killed after the round ends probably isn’t as bad as getting killed during it. You could think of a lot of these things. They all make it hard to make an assumption about the value of certain events.
But the problem is that we care about those events because that’s how we as players understand the game. We keep a inner count of how often we are dying or killing or doing all sorts of other things. Our judgment of our skill is based on how well perform on a round per round basis, and, in fortress, we tend to ignore the match outcome in judging how good the players in the match are. We are much more likely to look at individual scores on the scoreboard than the team ones.
And those scores are completely ignored by Armaskill. There is a disconnect between how good Armaskill says we are and our daily, intuitive judgments. The logic behind Armaskill is sound, and we should accept its ratings as an accurate portrayal of player’s performances in casual fortress. That is an important caveat, because not all players play with the same focus in casual games. If Armaskill is skewed or flawed, it is because of how seriously players take those matches. That flaws extends to kill and die statistics as well, and it shows up on our in-game scoreboard. Unless you decide for yourself how seriously a given player takes casual fortress, it is impossible to use Armaskill or any other statistic to judge if that player is better than any other.
In this study, we are not trying to judge players against one another, because, as I said above, that’s impossible without making assumptions. For the purposes of this study, we are going to ignore the effect of that flaw. We are going to take the data set as an accurate portrayal of performance, which it is, rather than of ability.
The goal of this study is to get inside Armaskill’s black box. By the end of it, we will be able to say something about the value of kills, dying, and the accuracy of the scoreboard at the end of matches. The goal is to create a statistic that ignores match outcomes and that closely resembles Armaskill. Compiling other statistics, we want to recreate the outcome Armaskill gives us as closely as possible. That process will tell us about the value of the statistics we use.
Part Two: The Data Set and Initial Observations
In this section, we will define and analyze the data set and then proceed to analyze Armaskill’s behavior in that data set
Picking a data set
Gridstats keeps data on several servers. For this study, we want a good set of data. A good set will have a large sample size of matches and a large number of players represented. It will also represent a relatively short timespan. Players change over time, and so a server that stretched over years is asserting a comparison between two players who may have never played each other and had only a few mutual teammates and opponents. We want to limit those types of players.
Another criteria is having as much consistency as possible with regards to the number of players involved in a match. Ladle statistics or pickup matches are the ideal for this, but the sample size for either is far too small. There are only three players with more than 500 matches played in pickup, and the Ladle statistics cover just two Ladles worth. Sample size requirements mean that we will have to go one of the casual servers.
Of the four choices, G5′s Mega Fortress Pro is by far the best. Of the four, it has the most players who have played more than 500 matches, with 169. Fortress Cafe, which represents a very long timespan, only has 94 such players, despite representing about 10,000 more total matches. Both MB53′s and DS’s servers have under 40 of these players.
We need to inspect the average number of players in a match. We don’t want a data set that represents 3v3 fortress, or 13v13. Gridstats does not currently present information on match participants, but it’s fairly simple to calculate. Taking those 169 players statistics, we can calculate the average number of opponents using the calculation below.
Hunter Average / Percent Enemies Hunted
Expanding this, we see that it gives us number of opponents.
( core dumps / round ) / ( core dumps / number of enemies ) = ( core dumps / round ) * (number of enemies / core dumps ) = ( core dumps * number of enemies ) / ( core dumps * round ) = number of enemies / round as the core dumps cancel each other
We see then that the average number of opponents is 5.82, with a high of 7.26 (Lackadaisical) and a low of 3.42 (aussie@forums, makes sense). Calculating the quartiles, we see that half of the players have between 5.45 and 6.25 average opponents. This looks like a fairly good data set. Two graphs of this information are below.

Graph 1: Number of opponents rounded to the tenths. One can see that only 6 players averaged below 4.8 opponents, making it truly an outlier condition. This is a fairly good data set.
Here’s another way of visualizing the same data.

Graph 2: Here is the same data, sorted and lined up.
The relative consistency of this notwithstanding, it is still important to take account of how the number of opponents might affect other statistics we care about. Some players have on average two more opponents than others. The high value is more than twice the low. If the average number of opponents has a significant effect on other stats, it might be something we need to adjust for. We’ll return to it when we start building our new statistic.
Analyzing Armaskill
First let’s look at the top 25 in this population. After this, we will ignore names, but it will be interest to come back and compare this list with the one our new stat creates at the end. Remember that this server is a couple years old.
1 - newbÎe (newbie@forums)
2 - -*insa*- (-*inS*-@forums)
3 - ~*mkay*~ (Mkay1@unk.me)
4 - ct|dreadlord (DreadLord@ct/junior)
5 - Lackadaisical (Lackadaisical@forums)
6 - madmax (madmax@forums)
7 - free kill (dlh@generalconsumption.org)
8 - Concord (Concord@forums)
9 - noob13 (noob13@forums)
10 - 75 (7575757575@forums)
11 - koala (Pre@forums)
12 - ~*¤kült¤*~ (helllo@forums)
13 - fingerbib (fingerbib@forums)
14 - Luzifer (Lacrymosa@forums)
15 - ¶Potter (ppotter@aagid)
16 - slash (slash@ct/public)
17 - ct_Cronix (Cronix@ct/junior)
18 - G5 (G5@forums)
19 - slash (slash@unk.me)
20 - <^v{}v^> (vov@forums)
21 - teen (teen@ct/public)
22 - .×] Hoax (Hoax@forums)
23 - ct|Puuquie (Puuquie@ct/senior)
24 - CTxGonzap (Gonzap@ct/leader)
25 - Syllabear (Syllabear@forums)
The first thing to look at is how Armaskill is distributed among the population of players. We would expect it to fit a normal distribution, a bell-shaped curve much like the one I drew in the introduction. I took all the Armaskill ratings of the 169 players and rounded them to the ones place, and graphed the frequency of a rating occurring. That graph is below.

Graph 3: Not quite a bell curve
In a general sense it looks like a bell curve, but upon inspection, we can see it really is not. What we would expect is for values to be more smoothly distributed. It is surprising, for example, that only one player has a rating of 19 or 20, bordered by 4 players with ratings of 18 and 21. What we observe is a sort of clumping around certain values: 5, 16, 24, 39. This suggests a sort of tiering is going on. There may be pockets of players that usually play together, and have slightly limited interaction with other groups of players. This makes sense, since certain groups of players play at certain times of day. It also stands to reason that certain players only play if they see other specific players in the server. It would seem that the best way to improve your Armaskill rating would be to only play in matches with high rated players.
This then is a significant flaw in Armaskill. It is contextualized by a player’s competition, if that competition is not completely homogenous across an entire population, then some sort of tiering or clumping, as we observe above, may occur. The graph above suggests four different groups of players, with mean ratings of 5, 16, 24, and 39. These groups intersect, certainly, but it is not clear that those numbers are accurate. The degree of heterogeneity of each populations may not insulate it from comparison with the population at large.
Nonetheless, the logic behind Armaskill is sound. Some other things are also worth checking it empirically. Armaskill should not depend on any statistic other than winning matches. We expect a strong correlation between Armaskill and round winning percentage, as well as statistics that capture team performance in rounds and matches. That’s all fine, and expected. What we don’t want and what we need to check for is correlation between Armaskill and non-performance based statistics, like number of matches played and number of opponents. A correlation between Armaskill and either of those items would suggest you can improve your Armaskill simply by playing more (or less) or simply by playing in more crowded (or emptier) fortress games.

Graph 5, Conclusion: No correlation between matches played and Armaskill
These two graphs compare our two suspects against Armaskill, and show them both to be innocent. There is no correlation between either and Armaskill, which is good. A correlation coefficient is a statistic that describes the correlative aspects of two data sets. A correlation coefficient of 1 would be perfect correlation, 0 is none. For matches played and number of opponents, the coefficients were 0.25 and 0.26 respectably. For a data set of 170, randomness can account for that very weak correlation. Furthermore, we can expect that a very bad player wouldn’t stay very bad if he played a lot of matches. The graph shows that for players playing more than around 2000 matches, a skill floor is created. You play enough, you’re certain to have a minimum degree of skill. As a general rule, if you cannot see any correlation in the graph, there isn’t any worth noting.
There aren’t any adjustments we need to make to Armaskill; it’s sound. The average rating is 24.8, and half of all players are rated between 13.1 and 35.8. The standard deviation is 16.0, as a percentage of the mean, 64.5%. So let’s get cracking.
Part Three: Recapturing Armaskill
In this final section, we get to start playing around with things. The first thing to be done is to find which statistics will be the most helpful in creating Concordance, our new Armaskill, then we must determine how to weight them.
The rules
- No black-box statistics can be used. These include round wins, match wins, and anything about team scores. These all say something about the team, and less about the individual.
- Armaskill itself cannot be readjusted. We determined in part two it was valid as is.
- Success will be evaluated by calculating the correlation coefficient of the new statistic. The better correlation we can get, the more we have succeeded.
Finding the statistics that matter
The simplest way to do this is simply to calculate the correlation between all the numbers and Armaskill. This will help us see what numbers are already closest to resembling Armaskill. In reality, I did this. For the sake of the article, we’ll pretend I didn’t, and we had to figure it out by inspection. This will give opportunity to see some other interesting things.
The natural starting place is took look at kills and deaths. We would expect this to form the core of our recreation. The frequency with which a player kills and dies seems to be directly connected to their skill. Simply enough, I graphed kills per round against deaths per round.

Graph 6: A loose but definite correlation
Here we see there is a definite, if loose, correlation between how often you die and how often you kill. Generally, the players who die the least frequent also kill the most. This might be because they are better or it might be simply because they are alive they have greater opportunity to kill. They might play defense, a good position for generating kills and protecting yourself from getting killed. Across the whole population, obviously, players kill once for every death, since all deaths and kills must be accounted for. In practice, since we have only taken players with more than 500 matches played, the numbers might not line up perfectly, but at a overall kills/death figure of 1.02, we see confirmation of what we would expect. The correlation coefficient is -0.48, suggesting a definite connection between the two. What is not revealed is the importance of each one. Do kills come as a result of being alive, or is surviving more important to success.
Comparing each to Armaskill should tell us which is more meaningful.

Graph 7: Correlation, but how strong?
Each clearly correlates to Armaskill, but it’s unclear from the picture how strong that correlation is, especially in the case of death rate (in green). We can see a general trend of high rated players dying less and killing more, but there are a number of outliers. In fact, death rate only has a correlation of 0.56, while kill or hunt rate has a rather strong 0.88. This means that if we simply stopped here, and used kill rate as Concordance, we’d get pretty close most of the time. Whatever we come up with is going to have to improve on 0.88.
The degree of randomness in death rate is somewhat surprising. In Ladle fortress we know that surviving is critical to winning rounds. That doesn’t seem to be the case in casual fortress. For our population, 24.6% of rounds ended in a 1v1. That means that at the end of the round, on average 11of 12 players had died, and thus death had almost no correlation on the round’s outcome. It plays into the point differential, though, and that factors into match outcomes. Armaskill recognizes it as a correlative, just not a very strong one.
Two other intriguing numbers are situational percentages. A players 1v1 ability should reflect their general skill, as should their 2v2 ability. When we compare these to Armaskill, we see a decently strong correlation.

Graph 8: Two good correlations
Neither are closely correlated as hunting rate, but they are both fairly good. 2v2 is a bit better, since almost every player faces more 2v2 situations than 1v1 situations. Those situations also happen to say less about the individual player himself, because another teammate is present. Continuing this logic, we should expect that 1v2 situations say a lot about the individual, while 2v1 says very little at all. It will important to figure out how to weight each, but they’re definitely being included in Concordance.
The final two things that seem obvious to look at are eyeball tests. Zeroing occurs when a player doesn’t kill anyone in a round, and so zero percentage says something about a player’s consistency. It also reflects how we intuitively understand skill. The names we see on the console every round kill people we assume to be better. We want to find out the accuracy of that eyeball test. The other eyeball test is match high scores. Are the people who score the most points each match really the best? We want to test that theory as well.
Again, we graph against Armaskill.

Graph 9: Strong correlations, very strong
These are more closely correlated than the situational stats, and hit at roughly the same levels as hunt rate. Zero is the best, have a correlation coefficient of 0.89, while Match high score is at 0.83. (Zero rate is actually -0.89, a negative relationship; the absolute value represents the strength of the correlation.) This means that for eyeball tests, the match scoreboard doesn’t really lie, but the real truth is in round by round results. A binge round where someone gets three kills says less than consistent scoring. Regardless, both ingredients will find their way into our recipe.
At this point, it’s worth recapping our findings. The best candidates for inclusion in our new statistics are Hunt rate, Zero rate, Match high score rate, and situational win percentages. Just to make sure we didn’t miss anything, a graph of each stat’s correlation coefficient to Armaskill is below.

Graph 10: 0.89 is the number to beat
Here we see our competition. We cannot use round or match win percentage in our calculation, but we want something that will correlate just as closely as they do to Armaskill. Match win percentage comes in at 0.911, and round win percentage is at 0.926. Whatever we do, it needs to improve on Zero rate’s 0.89 correlation. Our number should give a more accurate recreation of Armaskill than simply using one of the raw stats we are using to create our statistic. Suicide, team kill and zone statistics had correlations no better than random.
Determining how to count them
It’s clear that there are going to be three main pieces of my new statistic, which I have called Concordance. The first is some combination of the situation statistics, the second is match high score rate, the third is zero rate. Hunt rate and zero rate have some overlap. We know that because of what the statistics mean, and we can see it intuitively in their almost identical correlation coefficients. I will have to figure out how to appropriately discount hunt rate from zero rate. We will also include death rate, but adjusted to take into account its randomness.
Before weighing the numbers, it is worth investigating if any of them need to be corrected for the number of opponents. As it turns out, there is a slight but notable correlation between hunt rate and number of opponents, and so adjusting for this should slightly improve the number. Adjusted hunt rate (hunt+) is calculated by the method below.
hunt+ = hunt rate / ( average number of opponents / overall population average number of opponents)
It is only the slightest of adjustments, but it does improve our accuracy, and every bit counts.
I also combined the situation statistics, throwing out 2v1, and normalized them as a their likelihood of occurring. It makes sense for certain abilities to count more, based on that situation occurring more frequently. For example the 1v1 element of your situational win percentage is calculated by the following:
1v1 win percentage * ( number of 1v1 situations / total number of situations ) = 1v1 win percentage * 1v1 situation percentage
This is done for 1v1, 1v2, and 2v2. The three figures are then summed. A graph comparing the new situational win percentage and the old 1v1 and 2v2 percentages is below.

Graph 11: Situational win percentage
Combining them and weighing them creates a figure more closely correlated to Armaskill, at 0.84.
Adding them all up
For each number we include, we will weight it by it’s correlation. The better a number correlates, the more it will count towards the sum. Furthermore, since we are comparing different types of statistics, each figure will be normalized to the overall mean average. Lastly, each value will be given additional weight to improve correlation. These weights are what are interesting about the project. They tell us how much to value various events. Rather than choosing their value as our first step, we are setting it solely based on how it will improve correlation. By maximizing correlation, we are solving for the various weights.
Correlation Coefficient Weight Zero Rate -0.888 -5.222 Hunt+ 0.720 -1.172 Death Rate -0.556 -1.191 Situational Rate 0.843 3.484 Match High Score 0.823 2.013 Round High Score 0.718 1.077
There are a couple conclusions that come immediately to mind. Round High Score is worth about half of Match High Score. Zero Rate, discounted with Hunt+, is worth just a bit more than the situational rate. This could be interpreted as meaning that end of game performance is worth around the same as consistency throughout rounds, but a clear conclusions escapes me at the moment. Notable that Death Rate shows up at about the same weight as Hunt+ is discounted. The weight is the correlation coefficient divided by the mean times some constant. The constants are set by maximizing the correlation coefficient of Concordance against Armaskill. An individual players Concordance is each weight multiplied by their statistic in that category, summed.
The correlation coefficient of Concordance is 0.9301, which is just a bit better than Round Winning percentage, at 0.9256. Concordance, normalized to have the same average as Armaskill, is graphed against Armaskill below.

Graph 12: Success! Correlation coefficient of 0.9301 beats all comers!
If we look at our new top 25, we see some changes. (Remember, again, this data is a couple years old)
1 - -*insa*- (-*inS*-@forums) + 1
2 - newbÎe (newbie@forums) - 1
3 - ct|dreadlord (DreadLord@ct/junior) + 1
4 - madmax (madmax@forums) + 2
5 - free kill (dlh@generalconsumption.org) + 2
6 - Concord (Concord@forums) + 2
7- Luzifer (Lacrymosa@forums) + 7
8 - teen (teen@ct/public) + 13
9 - slash (slash@unk.me) + 10
10 - 75 (7575757575@forums) 0
11 - ¶Potter (ppotter@aagid) + 4
12 - ~*viper*~ (viper1@forums) + 17
13 - Lackadaisical (Lackadaisical@forums) - 8
14 - slash (slash@ct/public) + 2
15 - <^v{}v^> (vov@forums) + 5
16 - noob13 (noob13@forums) - 7
17 - koala (Pre@forums) - 6
18 - ct_Xyron (Xyron@ct/junior) + 13
19 - _~R~_Luffy (Monkey.D.Luffy@forums) + 18
20 - ~*mkay*~ (Mkay1@unk.me) - 17
21 - esspeenuubee (Fort.nub@forums) + 11
22 - Syllabear (syllabear@forums) + 3
23 - 0ma (0ma@forums) + 26
24 - CTxGonzap (Gonzap@ct/leader) 0
25 - CtxWoned (owned@forums) + 20
Obviously, both Concordance and Armaskill have the same flaw– How seriously did these people play? I would argue that Concordance depends less on effort than Armaskill, because the statistics it counts are not “effort” statistics. There is a lot that goes into winning besides what Concordance takes into account, but most of what it doesn’t is tactical types of plays that we can loosely associate with how much someone cares about the match. More casual players are likely to still play well in situational cases, and are likely to still try to kill people. But they might not help their team win in all the small ways that Armaskill takes into account. That’s just a guess. I’ll try to push an argument forward in another piece. For now, enjoy the graphs.
Yours Truly
