Why reviews don’t matter, part 4
Recently we took a look at the theory and practice of videogame review scoring (if you weren't here at the time, it was even MORE exciting than it sounds), and arrived at some sensible and rational conclusions,as usual.
Not long afterwards, alert WoSblog viewer Steve Hogarty – a former editor of recently-deceased and generally well-regarded UK games mag PC Zone – sent me a link to an interesting thing.
This spreadsheet lists every mark awarded by PCZ in its entire 17-year, 224-issue history. The broad picture painted by the various graphs isn't exactly surprising – showing an "average" score somewhere in the 70s (rather than anywhere near 50), issues where the lowest score awarded was 70%, and even a staggering example where the average score awarded in a single issue was an incredible 86%, but the detail is illuminating.
(The 86% wasn't a fluke, incidentally – there are several averages of 85% and 84% too.)
I spent a few minutes breaking down the scores, and arrived at this handy cut-out-and-keep summary:
TOTAL SCORES 0-9: 23
(a single game scored 0%, and one got 1%, in issues 17 and 96 respectively. Anyone know what they were?)
TOTAL SCORES 10-19: 104
TOTAL SCORES 20-29: 102
TOTAL SCORES 30-39: 172
TOTAL SCORES 40-49: 267
TOTAL SCORES 50-59: 365
TOTAL SCORES 60-69: 634
TOTAL SCORES 70-79: 784
TOTAL SCORES 80-89: 869
TOTAL SCORES 90-99: 389
(no games ever scored 98%, 99% or 100%. In fact, the graphs don't even go up to 100, suggesting that the mag considered the 'perfect' mark impossible to achieve.)
Immediately we can glean a few things from this info. From a total number of 3,709 reviews, just 668 scored below 50%. That's 201 fewer than were scored in the 80-89 band alone, counter-intuitively suggesting that games which were excellent verging on brilliant were substantially more common than ones that were average or worse.
(A conclusion supported by the fact that more games scored exactly 90% than are found in the 20 slots between 0% and 19% put together.)
The median game from 3,709 reviews would be the 1,855th in the list, and marks the fulcrum point around which other marks should be distributed. Rather than being a game scoring 50%, the 1,855th-placed game in the list in fact comes somewhere in the 71% band. If the PCZ review score distribution was a seesaw, there'd be a great big fat kid sat on one end.
The frequency chart shows an unexpected bias for "neat" scores, with big spikes at every multiple of 10 (50%, 60%, 70% etc) compared to the marks either side. A whopping 774 games have a score with a 0 on the end, more than double the 371 you'd expect if the marks in the scoring range were spread around evenly, even allowing for the skewed-upwards overall range.
This is a very weird thing to find in a mag marking out of 100 – normally you'd expect reviewers to err, consciously or otherwise, on the side of 1% either way, purely to maintain the illusion of precision. If you're going to give most games an exact multiple of 10, why not just mark out of 10 in the first place?
(PCZ didn't mark out of 10 at any point in its life, which could have explained the anomaly.)
The phenomenon is even more striking if you factor in the further 499 games which got scores ending in a 5, which means that a little over one third of all games reviewed by PCZ got multiple-of-5 marks, rather than the one-fifth you'd reasonably expect in a percentage scale.
If you're not sure there's anything noteworthy about that fact, the dropoffs from the "even" scores to the ones immediately preceding and following them are spectacular. While 75 games scored 50%, for example, just 15 got 51%, with 21 scoring 49%. Slightly further up the table 162 were thought worthy of 70%, yet just 54 were worth the extra 1% to get to 71, and a mere 66 fell short by 1% to notch 69.
Once again, then, we've found that an ostensibly simple, linear, numerical system of rating games for the benefit of consumers is in fact a highly coded and abstract language requiring the reader to interpret and translate it on the basis of information which is never spoken or officially acknowledged – indeed, which is actively denied by magazines who routinely insist on their pages that "50% means average".
It's a weird way of doing things, wouldn't you say?










Doesn't account for everything, but since the PC is an open platform there's a lot more dross out there that just doesn't get reviewed. There should be a bias towards higher scores because of that selection bias.
Oh and according to wikipedia, Newsweek 3 Globocop was the 0% game.
That’s not how it works, though. You can’t expect readers to interpret your review scores calibrated on the existence of games they’ve never heard of because you haven’t reviewed them. For scores to have any rational meaning at all, the ONLY things they can be related to are the other scores of the games you’ve reviewed.
This is probably the same over every magazine, going back more than a decade.
*goes to check old issues of CVG, NMS etc.*
…yep
Not really surprising, nor a revelation. Nobody wants to tread on the developers' toes, do they? I picked up a copy of GamesMaster (I needed to wash my hands afterwards), and the reviews for all the big name companies were high, and it looked like they were only comfortable ripping into the indie developers at the back.
I wonder why?
Amiga Power managed a lifetime average of 55%, I think. Not a bad attempt.
Now, I don't have my Worzel Gummidge maths head on, but would there be a simple procedure you could run a score through to in order to re-score it according to an average being set at 50? i.e. to reset the baseline as it were? (I'm guessing there's some way out there, involving something a bit more complex than just removing 50 points from every score, although in the case of woeful games, maybe negative figures is something they'd deserve?)
Interesting seeing the figures from PCZone though, any chance of you rummaging through all the Your Sinclair reviews and doing the same?
I'd suspect the fairly human tendency to choose nice round numbers isn't as weird as you make out and it's only if someone conciously makes an effort, to "maintain the illusion of precision" as you put it, that you'd ever get 'perfect' averages.
(Wut what was game that scored 1% ? (Whatever it was, it ought to have been H.E.D.Z. – the one time (or at least, most memorable) that I disagreed with PCZ's scores))
Well, they're in a happy minority then.
More evidence, from about 4 years ago: http://www.metafuture.com/ where they did the same kind of analysis for IGN and Gamespot scores, with very similar results.
One argument that I've seen put forward by PCGAMER was that they only reviewed games they thought people should know about, thus if they didn't think people cared about game x which would've scored in the low band anyway they aren't put in. So maybe its that the games reviewed are better than average but the stats don't say so because the less than average games aren't there.
The same still applies, though. If you're saying the games you don't review aren't worthy even of consideration, your job is still to tell the readers which of the ones you HAVE reviewed is the best, and therefore which ones they should buy.
I've had PC journo friends make the argument that there's nothing wrong with most games scoring 60%-90% if most games are good, but that's nonsense unless there's some sort of Average Game locked in a bank vault somewhere, from which fixed, unchanging empirical standards of "good" are calibrated.
(Which is, of course, an insane idea, because standards change over time, and when do you arbitrarily recalibrate the standard?)
The only purpose of having reviews at all is to rank games against their competition, because people can't afford to buy them all. If you say "they're all pretty good, really", your existence is pointless and useless.
I find that a scoring system is utterly pointless, and I'd prefer a better than/worse than system.
The text of the review should speak volumes about the quality of the game. These days it usually doesn't, as the reviewer is too busy waffling on but that SHOULD be the case.
The raw numbers also exist for AMIGA POWER, thanks to hours and hours of misery-filled wasted effort by splendour’s Brig Bother, also me. They’re in AP2′s Stats section (http://dspace.dial.pipex.com/ap2/ > Issues > Stats) with the original figs directly accessible as plain text files via http://dspace.dial.pipex.com/ap2/comments/JN/one_plus_two_plus_one.html . The method I used to ruin Brig’s tireless research and calc the stats for illustrative purposes (Total Score / Number of Things) is apparently so laughably useless in the world of sums that TO THIS DAY I am kicked on the shin in the street by four-year-olds under the direction of a primary school teacher crouching behind a wall focusing a pocket telescope, but anyone interested (ie, drunk enough) should be able to “import” the “datums” into one of these new-fangled spreading-sheets and make a graph that LOOKS LIKE A HAND FLICKING THE Vs FROM BEYOND THE GRAVE TAKE THAT THE ESTABLISHMENT.
(I’m pretty sure I have a similar list for YS somewhere, from the bewildering Every Game Reviewed Ever Starring Diana Rigg bit, but that’s hardly worth converting because it’ll cheerfully come out as an average 88.99% or something in a funky skillo sort of way.)
Ah shite. I'm sorry – my previous comment's formatting has disappeared and it's just a big chunk of nonsensical text. I'll try again. Stu – delete it if you can!
"That’s not how it works, though. You can’t expect readers to interpret your review scores calibrated on the existence of games they’ve never heard of because you haven’t reviewed them. For scores to have any rational meaning at all, the ONLY things they can be related to are the other scores of the games you’ve reviewed."
Fucking, exactly, this. Instead, it seems to me that most magazines and websites score in relation to EVERY OTHER magazine and website in existence, all of which over-score their "average" games in the 70% region, therefore creating a huge circle-jerk of broken review scoring.
Also, re the ZONE stats, it's interesting that they scored so many average games in the 70s and 80s considering that when they started out, they were quite notorious for being harsh with their scores.
Get rid of scores or use them properly.
This looks uncannily like the grading spread in US schools. Except more balance.
"The raw numbers also exist for AMIGA POWER…"
Ooh, odd. I was sure I'd checked this before and it came out at 55%. Maybe that was only counting full-pricies or something.
But isn't this global? Most films have a 5 star rating system, which stops it, but I bet that ends up being higher on average than your suggested system (a ranking against current films). I don't think its an issue as long as the reader understands the meaning- that a game is competent at what it does, but not particularly acceptable, will hit 60-70%. Indeed changing that would actually be communicating badly. What matters is the readers understanding, not meeting some metric. I actually think most people have a reasonable idea of what the metric means.
I'm not saying scoring is infallible, or even necessary, but if it must exist I don't believe that this system is any worse than an alternative.
And how is a new reader supposed to understand that 71% means 50%?
I don’t accept the argument that 4 means 2 if lots of people think it does. It means 4.
Dean Love:
"…there's a lot more dross out there that just doesn't get reviewed. There should be a bias towards higher scores because of that selection bias."
RevStu:
"That’s not how it works, though. You can’t expect readers to interpret your review scores calibrated on the existence of games they’ve never heard of because you haven’t reviewed them. For scores to have any rational meaning at all, the ONLY things they can be related to are the other scores of the games you’ve reviewed."
That kinda relies on your readership having the kind of extensive, experience-based knowledge of the games you've previously covered to be able to interpet the scores of the newly reviewed titles. So yeah, it's a great system if all your readers are full-time videogame critics who also happen to be fans of your publication — good luck selling that demographic to the venture capitalists, though.
This is why the objective metric is prefereable: A rough consensus on what constitutes a good, average and poor quaity game is far more easily arrived at and applied by your readership than the sort of experience-based omniscience that would *render your reviews redundant anyway*.
Speaking of omniscience, we don't have it. Hence the further from the end of a publication's life a game is reviewed, the more erroneous its relative position is likely to be in hindsight. This seems to be something your criticism of PCZ's figures completely neglects to recognise — the totality of scores is arrived at incrementally, not en masse in the final issue. Your alternative system would be just as subject to this inbuilt error as PCZ's, unless it's seriously your proposal to revisit every single score ever doled out each and every month?
What I'd hope the above points to is this: Insisting that the average score given over the course of a publication's life should equal the midpoint of the available range is a contrivance of little merit. Dump it.
The other point, that games of average quality routinely receive above average scores, is surely far more pertinent and something that should be dealt with not by resorting to insular marking schemes, but by reviewers straightforwardly marking more honestly. And percentages being banned.
I don't know where you’re getting your idea of "my system" from. What I've identified as the ideal is a chart-based ranking, with scores present only as a rough short-term guide. No omniscience is required, indeed the precise opposite is true. If you want to know how good a game is today, read the score. If you want to know how good it is of all time, look at the chart.
I just won't tolerate any nonsensical idea of there being an empirical value for "good" or "average" games without anything to measure them against. It's the absolute worst sort of spurious pollyanna bollocks, which leads to the scenario we've discussed in earlier features, where games are given 40% just for loading up without exploding.
Where is your "rough consensus" coming from? Please explain how it's arrived at, if it's "objective".
What I have "gleaned" from all this is that scores are pointless.
Which was a conclusion I'd already drawn. But then games journalism is largely pointless anyway. People will buy a game even if every reviewer says it's shit.
See – most EA games, Simpsons games, licensed tat, Barbie games etc.
Just thought I should mention: in America at least many grading scales work like that: school grades (outside of college), for example, are usually meant to center in average around a 70, which, naturally, is then made the point of pass or fail. Read like that, gaming reviews suddenly make a lot of sense.
Of course, what with this being a British magazine, they have no excuse.
I hate scoring games. That's partly why I did the graphs, because I'm really not into the unscience of percentages. Preconceptions, prejudices, received opinions – but never corruption, at least not in my experience – affect a score. It's an impossible conceit to suggest otherwise.
Don't think I agree with this, though:
You only need to set your goalposts once, though – once you've said "this is a 50% game", you don't need to find and review one every month just to remind people of the mediocrity benchmark. Or would you say you do? Does every issue have to make statistical sense in isolation? You could print an example game of every score in the Reviews section intro, but that wouldn't go towards fixing this score problem, would it?
The reason I found these numbers interesting is because I've never really liked percentages, but I dislike the granularity of #/10 even more. I'd rather have the nuance of a second number to say "yeah, it's a 7 – but it's a good 7". As a reader, even before I started writing for mags, 70 was like "THIS far from being a 6".
Which only goes to show, even when I see percentages, I'm still thinking in /10. Probably explains the spikes at 0 and 5s.
"You only need to set your goalposts once, though – once you've said "this is a 50% game", you don't need to find and review one every month just to remind people of the mediocrity benchmark. Or would you say you do?"
Thing is, there ISN'T a benchmark game. That's my point, that's the argument against those who say "it's okay for most games to score 70% because most games are 70% good". That policy can only work if you've got a 50% game locked in a vault somewhere, like the Bank Of England has the definitive 1kg weight under lock and key or whatever.
Which brings us to your second point, which is that basically, yes, you DO have to recalibrate every issue, because standards change, especially in the PC market. What if Zone or Gamer had chosen their benchmark game when they were launched, in 1992 or whenever? That game is now almost 20 years out of date, and to say it still represents the median or mean point of quality now is plainly bananas. But then you have to face the question of how often you update it. Yearly? Bi-yearly? Monthly? Issuely?
The percentage is a short-term measurement, the chart or ladder placing is the permanent one.
If recalibration is only required every so often, maybe (and of course, this assumes an awful lot about the intentions of the disparate magazines and reviewers) that explains the proportions. Meticulous recalibation, performed in perfect objectivity, would turn out that 50% average score.
But what do we mean by recalibrate? Review every release, and declare that games have got three percent better on average, and need to be marked down accordingly? Doing that every month would be relativistic madness.
Other ways of calibrating – a panel saying "this game is our definition of a 1, 2, 3, etc" wouldn't go into the statistics I produced here. I'm not defending scores, just saying that those charts might not support the conclusions you're making about average scores.
Out of interest, and because information is great, the month that averaged 86 was these games:
Daggerfall (65) Strife (70) Megarace 2 (79) Cyberstorm (85) Bedlam (89)
Nihilist (91) The Pandora Directive (92) Links LS (94) Syndicate Wars (95) Quake (96)
And that's with an underscored Daggerfall in there, too. If could easily have been a 90 average.
Like I said, I don't like scores, as much for my own tribal reaction to other people's scores, as my own hesitance to award them. They really do bring out the reactionary simpleton and rubbernecking cunt in me. "What, Halo 3 got 10 in Edge, fuck off, boo, why hasn't my better opinion got any attention, etc."
It's worth pointing out from the AP2 stats page that Rev Stuart Campbell's average percentage score is a slightly soft 62% (from 357 reviews, so a big enough sample), while the only ones to average out close enough to a 50% average are Rich Pelley (52.38% from 64 reviews) and, er, J Nash (53.52% from 60 reviews). Altogether now – "This probably just goes to show something, but I sure don't know what."
That's funny, as I honestly can't remember many occasions on which Stu doled out a mark in the 50s or 60s – the ones that stick in the memory for me are the savage sub-10% lows (International Rugby Challenge or Kick Off 3, obv, but the one I most vividly remember is Dennis for some reason – Stu not being able to find a single Upper, and then a huge list of "useless…" things in the Downers, which spilled over into the Bottom Line), or the really high ones (especially the unexpected ones like Yo! Joe or Naughty Ones (I haven't given anything 90 for ages. Sod it, I'm going to give it 90) where the enthusiasm was bouncing off the page. I wonder if that's just a wrong impression I've got, or whether the mean is being distorted by lots of lows and highs to give a seldom-awarded average?
We now return you to your scheduled programming.
Mmm, a cursory unscientific glance down the list suggests that whilst Stu is famous for his low scores, his mode decile is in the 80-89 range.
I am not even sure mode decile is the correct terminology, but I'm sure you know what I'm getting at.
Quite interesting.
RevStu
"I just won't tolerate any nonsensical idea of there being an empirical value for "good" or "average" games without anything to measure them against…Where is your "rough consensus" coming from? Please explain how it's arrived at, if it's "objective"."
Well the point of reviews is to provide buying advice. We're fundamentally trying to express the extent to which you can expect to be satisfied by your purchase, and I think this really matters: As buyers consumers aren't comparing game A to game B, they're comparing what they get from game A to the cash they had to cough up to get it. Is it, as an isolated product, worth their money?
So we can recognise, at least, that we have a shared sense of what "good", "average" and "poor" mean in terms of consumer satisfaction. What I'm aiming for as a reviewer, then, is to classify as "good" the games that will invoke a sense of being good value for money in my readers. It's really not very helpful if, because I've only reviewed a limited selection of dreck and insist on marking relatively, games that occupy the top tier of those I've covered are actually titles that my readers find to be awful value for money (or, conversely, if they miss something brilliant because it occupies only an average position amongst those covered). That's an exagerration, obviously, but it does illustrate the inherent disconnect in your approach to score distribution.
Proceeding along my line, the obvious question is how does a publication ensure it's in-tune with its readers? In terms of *what* you classify as good/average/poor, this is partially self-fullfilling — you can assume that your core readership shares your tastes and has the keenest comprehension of the highest end of your scoring range (where the games they're most likely to have played reside) from which to derive the current state of middling products ("poor" being something of a fixed concept); In terms of *how* you classify the good/average/poor, well you yourself have chronicled how we've arrived at the general consensus of "7 = average", Stu. I'm not reading it back to you.
RevStu
"I don't know where you’re getting your idea of "my system" from. What I've identified as the ideal is a chart-based ranking, with scores present only as a rough short-term guide."
I think that's impractical for a short-staffed monthly, but even as an academic exercise I have reservations. I mean which charts higher, Resident Evil 4 or OutRun CtC? How is that meaningful? How is it useful? Are you going to do it by genre? What about cross-genre titles? What about genre-defying titles? How are you going to factor in decreasing prices? Do series feature their defining title only or are you really going to list every bleeding Medal Of Honor game?! What I mean, really, is that "do a chart" isn't an idea.
RevStu
"Once again, then, we've found that an ostensibly simple, linear, numerical system of rating games for the benefit of consumers is in fact a highly coded and abstract language requiring the reader to interpret and translate it on the basis of information which is never spoken or officially acknowledged – indeed, which is actively denied by magazines who routinely insist on their pages that "50% means average"."
Is this even true? I just plucked a copy of Play that I worked on off the shelf (issue 77 — I've been a long time dead) and here's a quote from our "guide to the grades", which sat at the start of our reviews section:
50-74
These are average games that may be entertaining but are fundamentally flawed or don't have any lasting appeal. Genre fans might like them.
Now much as I think the last line is silly (are genre fans really less discerning? I'm not convinced) and as much as I'd personally prefer that particular range to center around 50%, we at least made it crystal clear that a game of middling quality could score into the seventies and cited the characteristics that would land it there (and, obviously, there was a similar explanation for each of the other (five) ranges of scores). Is this no longer generally done? Someone pop into Smiths and check.
Not that it matters. When was the last time you flipped open an magazine and couldn't tell which games it recommended? If reviews don't matter (which they don't) it's nothing to do with this.
"I mean which charts higher, Resident Evil 4 or OutRun CtC?"
According to you, the one which best invokes a sense of being good value for money in your readers. Simple.
What does 100% mean, anyway? Does it mean that the game has no discernable flaws? Unless there's an agreed upon standard of "excellent graphics gives 5%, appropriate and engaging sound effects gives 7%, a solid and intriguing storyline gives 8%…"… well, how is everything worked out and worked out consistently from one game to the next?
How do we agree on what 100% means? It's impossible to define 'perfect' for a literary work, painting or piece of music. This is why I'd like separate scores for various elements of a game. Graphics, sound, storyline (where applicable) etc. Even then, most modern titles don't have shit graphics and sound. They're all pretty damn good, but some are just excellent (Fallout 3, though some may disagree).
We might as well just have "Shit – Meh – Worth a look – Interesting – Pretty good – Good – Excellent – Must buy" as a review conclusion scale. Is anyone really, honest-to-goodness going to decide between the purchase of 2 titles and buy the one that got 75% instead of the one that got 74%?