×
  • remind me tomorrow
  • remind me next week
  • never remind me
Subscribe to the ANN Newsletter • Wake up every Sunday to a curated list of ANN's most interesting posts of the week. read more

Forum - View topic
The Best and Worst of Winter 2019, Mar 12-18


Goto page Previous  1, 2, 3  Next

Note: this is the discussion thread for this article

Anime News Network Forum Index -> Site-related -> Talkback
View previous topic :: View next topic  
Author Message
NPC



Joined: 21 Sep 2016
Posts: 56
PostPosted: Tue Mar 26, 2019 5:56 am Reply with quote
zrnzle500 wrote:
^Critiquing the rankings based on 5 data points (really 4 because the most recent points will not be reflected in the rankings until next week) from 2 shows is not compelling, especially when only two of the 4 reflected in the rankings are lined up correctly, as shown below

It's absolutely fine. I don't need to write down ratings for entire season for all shows to demonstrate the problem. It is especially true for "instant" rating, one data point is sufficient. I used more only to demonstrate that cumulative rating is broken too.

Quote:
Index
February 19-25: (4.6) February 26-March 4: (4.4) March 5-11:(4.2) March 12-18: (4.0)
Boogiepop
February 19-25: (3.8) February 26-March 4: (3.8, 4.0, 4.1, 4.1, 3.4) March 5-11:(3.7) March 12-18: (4.2)

So, on March 4 week Index beaten all 5 Bogie episodes instead of just one. In the same week Index lost 5 positions in rating while Boogie gained 3. Ridiculous.

Quote:
You only mention the weekly ranks of the last one (the actual last one reflected in the rankings, Boogiepop 4.2, Index 4.0),

My mistake. It is "the best and the worst" for the week of March 12-18, I used the last review before 18th instead of going by episode number. Whatever. Look at the March 5-11 week in your data. 4.2 vs 3.7 with corresponding ratings 20 vs 15. 0.5 points above in community score, 5 positions below in rating.

It doesn't matter how you shuffle the data, the "ratings" don't make sense.

Quote:
If we go back one more week for a full 5 weeks, we see that Boogiepop episode 8 (4.1) also ranks below Index episode 19 (4.4), by a fair margin (#14 vs #9)

This is the case of broken clock showing correct time twice a day. We have to look back a few weeks to see something somewhat plausible. This weird "wild swinging" of ratings unrelated to user's opinion is exactly what makes them useless.

In your data set Index is better 3 times out of 4, sometimes much better, yet it lost 8 positions in cumulative ratings in the same exact time. This is ridiculous no matter how you look at it.

Quote:
Your claim that ranking things by people's preferences (and not just raw averages) means they are random numbers does not make sense.

The numbers are not completely random but they could as well be, that's why I wrote "mostly random". I didn't make a baseless claim, see examples above.

Quote:
That you can cherry pick a very small set of points (mostly incorrectly) and don't really understand how it works does not mean it is wrong.

To demonstrate that algorithm doesn't work it is natural to pick an obvious example. Using more data doesn't magically make the bad case disappear. I agree that going back 5 weeks on cumulative ratings was wrong because there were additional reviews, but it makes little difference to the big picture. So, instead of losing 1 positions Index loses 2 and Boogie still gains the same 6, it is even worse.

I understand how it "works" well enough to make an observation. Ratings clearly don't reflect readers opinions as they supposed to. I don't know precisely why because I don't have neither original data nor meaningful description of methods used. To notice it doesn't work I don't really need to know precise reason.

Note that you "don't really understand how it works" because the method of adjusting ratings of 27 shows based on pairs comparison is not described completely. It doesn't stop you from making claims about it.

You pointed a few inconsistencies in data selection. But your new and improved data set doesn't really change anything. This correction doesn't make your arguments any stronger, it works against you.

Quote:
Your use of averages also assumes a) the people watching one show are roughly the same as those watching another and the numbers are of roughly the same size as each other and b) the averages are not susceptible to foul play both for and against any shows.

Nice straw man, you are the one making all the assumptions. Feel free to quote where I claimed anything like that.

I demonstrated that at least in one case ratings are barely affected by average community scores and seem to be mostly random. They are based on information that only ANN has access to and can't be verified in any meaningful way. Which would be fine if they at least looked somewhat plausible. But they don't.

Rating that ignore average scores a) arbitrarily gives ridiculously high weight to opinions of some voters while ignoring the others (defect discussed often lately), b) MUCH more susceptible to foul play because of (a) and c) with small data sets prone to produce puzzling "wild swinging" ratings unrelated to reality.
Back to top
View user's profile Send private message
zrnzle500



Joined: 04 Oct 2014
Posts: 3767
PostPosted: Tue Mar 26, 2019 6:40 am Reply with quote
^No, given that the rankings are based on how each of the shows are preferred versus the others, you need to compare not just two individual shows, but also the ones in between, which also affect their respective rankings.

Also using only the most recent points in the weekly rankings then pointing to the cumulative without going through the previous ones is missing part of the bigger picture, as the cumulative is built from all the episodes in the cours.

Assuming that the average ratings are the correct way of comparing shows means that you also assume that the averages are comparable and that they are accurate (and have not been manipulated in some way). That you didn’t say the words out loud doesn’t mean you aren’t assuming those things, as both but especially the latter are necessary for the averages being a meaningful way of comparing shows, let alone a better one. That the averages have been manipulated in the past is well documented, and in one case even admitted to on this forum (by someone trying to counter another's manipulations no less). Often, where rankings and averages diverge, it is the average which is suspect and not the rankings, which has been immunized from the sort of manipulation that the averages have been subjected to many times now (again not necessarily saying that this is the case here).
Back to top
View user's profile Send private message
BodaciousSpacePirate
Subscriber



Joined: 17 Apr 2015
Posts: 3017
PostPosted: Tue Mar 26, 2019 9:27 am Reply with quote
The Schulze method is what BoardGameGeek.com uses, and that site is pretty much designed around the premise that people like to argue about numbers on the Internet. Laughing If it's good enough for them, then I figure it's good enough for us.
Back to top
View user's profile Send private message
NPC



Joined: 21 Sep 2016
Posts: 56
PostPosted: Tue Mar 26, 2019 12:08 pm Reply with quote
zrnzle500 wrote:
^No, given that the rankings are based on how each of the shows are preferred versus the others, you need to compare not just two individual shows, but also the ones in between, which also affect their respective rankings.

Not necessarily. This is an option indeed but there is nothing that makes it a necessity. In this particular case this approach leads to weird inconsistent results which makes entire exercise completely pointless. MAL ratings are based on averages. Guess what, none of their ratings is "swinging wildly" or look manipulated in any way.

Quote:
Also using only the most recent points in the weekly rankings then pointing to the cumulative without going through the previous ones is missing part of the bigger picture, as the cumulative is built from all the episodes in the cours.

Nope, I am missing nothing. I specifically pointed out change of cumulative rating over period of time instead of absolute value. Shows in the lower part of the table have less voters, that's why "let's fix the average results" approach make them "swing wildly". If you accumulate these results over time results may vary. It may average out the peaks and make it somewhat look plausible. Or you can accumulate garbage and receive more garbage as result. I pointed out a glaring example of the latter.

Quote:
Assuming that the average ratings are the correct way of comparing shows means that you also assume that the averages are comparable and that they are accurate (and have not been manipulated in some way). That you didn’t say the words out loud doesn’t mean you aren’t assuming those things, as both but especially the latter are necessary for the averages being a meaningful way of comparing shows, let alone a better one.

If you want to argue about the benefits of approach you like you can do so without arbitrarily assigning some assumptions to your opponent. Again, I didn't say anything like that, guessing my assumptions is pointless. If you guessed wrong, I am supposed to argue with you that you guessed my assumptions wrong instead of discussing your point of view directly? Waste of time. More likely you expect me to accept your assumptions and start defending them as my own, that's commonly known as "straw man fallacy" and should be avoided. Bad taste and all.

Quote:
That the averages have been manipulated in the past is well documented, and in one case even admitted to on this forum (by someone trying to counter another's manipulations no less). Often, where rankings and averages diverge, it is the average which is suspect and not the rankings, which has been immunized from the sort of manipulation that the averages have been subjected to many times now (again not necessarily saying that this is the case here).

See, this is the weirdest part of your assumptions you assigned to me. You seem to believe that ratings based on averages are more vulnerable to manipulation than ratings which mess with them. This is so obviously false that I have no idea how you came to this idea.

In case of averages every single voter has one vote. His ability to affect the result is very limited. Basically, to cheat he needs to vote more than once.

Now ANN starts to fix this unbroken thing and says "if, out of all people who rated both A and B, 60% preferred A, it will be ranked higher than B". So, all the malicious voter needs to elevate his opinion is vote for his favorite show AND all other shows giving them lower (not necessarily unreasonable) ratings. Note that this technique is most powerful for less popular shows where voters are fewer, and chances to be part of happy 60% who decide who wins are much higher.

Here is an extreme example to demonstrate the idea. Show A is watched by 100 people, they give an average result of 4.5. Show B is watched by just 2 people, they also happen to watch A. Both consistently give A 4 points. One consistently gives B 3.5 and other - 4.7. Any third person joining them and giving show A any score lower than B makes B rated higher then A! Magic. 97 people who rate A but not B were just overruled by a single voter. Sounds familiar, doesn't it?

Note that it doesn't even have to be an attempt of manipulation - the algorithm is prone to errors like that by design. That's why rankings in lower part jump wildly without any visible correlation with user's input. Now, if we have an intentional manipulator, having multiple accounts and applying trivial non-detectable voting strategy mentioned above makes him exponentially stronger voter. For rankings based on averages additional accounts increase voter's weight just linearly.

Now, there may exist some protections against scenario I described above. But it doesn't change the fundamental problem of assigning weights to user's opinions based on their own behavior, it just another set of kludges.
Back to top
View user's profile Send private message
NeverConvex
Subscriber



Joined: 08 Jun 2013
Posts: 2292
PostPosted: Tue Mar 26, 2019 12:32 pm Reply with quote
NPC wrote:
Here is an extreme example to demonstrate the idea. Show A is watched by 100 people, they give an average result of 4.5. Show B is watched by just 2 people, they also happen to watch A. Both consistently give A 4 points. One consistently gives B 3.5 and other - 4.7. Any third person joining them and giving show A any score lower than B makes B rated higher then A! Magic. 97 people who rate A but not B were just overruled by a single voter. Sounds familiar, doesn't it?

Note that it doesn't even have to be an attempt of manipulation - the algorithm is prone to errors like that by design.


This may be how ANN's implementation works, but strictly speaking I don't think this is the fault of using the Schulze method. The Schulze method assumes as input a weak preference ordering is provided by each possible voter. It doesn't tell you what to do if some of your voters don't furnish that information at all, and it makes no attempt to model the reliability problems associated with one show having a lower sample of responses and so more sampling variability than another show. Those issues have to be dealt with by making choices about how to modify the base Schulze method to account for complications it didn't model.
Back to top
View user's profile Send private message
zrnzle500



Joined: 04 Oct 2014
Posts: 3767
PostPosted: Tue Mar 26, 2019 7:24 pm Reply with quote
@NPC To clarify, I'm not taking a position on using averages in general, just on the averages on this site.

NPC wrote:
You seem to believe that ratings based on averages are more vulnerable to manipulation than ratings which mess with them. This is so obviously false that I have no idea how you came to this idea.

In case of averages every single voter has one vote. His ability to affect the result is very limited. Basically, to cheat he needs to vote more than once.


Which is what has happened on this site to these averages. On this site, everyone who visits a review page can give a rating, not just registered members, and the staff here have expressed they want to keep it that way. In previous reported cases on this site, malicious actors have set up programs to effectively vote multiple times, enough to move the averages as desired. Even without that, one can vote more than once if one rates an episode on one's computer and then rate it again on one's phone or even another browser (provided you aren't logged in on both). Using these programs is only scaling this method up to be enough to affect the averages, which again has happened multiple times in the past on this site. Sites like MAL are not vulnerable to this kind of attack, as only registered users may rate a show, which would mean they would only need to prevent these programs from being able to log in. The rankings in this column are not affected by these attacks, as those fake ratings are not included in the calculations (they do not publicize the method that they use to do this so that the people behind the attacks don't know how to get around them). So long as the averages on this site are vulnerable to spamming fake votes as above, the averages by themselves can't be taken at face value when they significantly diverge from the rankings.

Like I have said before, you have been assuming that the averages on this site are always accurate and haven't been manipulated (at least noticeably). This has been proven wrong in a number of previous cases, as anyone who has followed these rankings from the beginning can vouch. When you present the averages and claim they are more accurate than the rankings based on their method because the ranks don't match with the order of the averages in some cases, you are necessarily assuming that a) comparing the raw averages can accurately tell you which shows are preferred and b) the averages themselves are accurate and have not been (noticeably) manipulated. That is not a guess, but a function of the argument you are making. If the averages couldn't accurately tell you which show was preferred over the other, saying the ranking is not accurate because it doesn't match the order based on the averages in some cases would be meaningless. Likewise, if the averages had been inaccurate because they were manipulated, you could not rely on those averages to tell you which show was preferred, as those numbers wouldn't be the real number or, as you would say, basically just be random numbers. Both need to be true to make the argument that you had made, so it is not a strawman to point that you are making those assumptions needed to make your argument even before you say it explicitly. You can reasonably argue that assuming the raw averages are comparable to each other and able to show you which shows are preferred is correct, and for sites like MAL, you can probably assume that the averages are not manipulated to any noticeable degree, but for this site's averages, assuming the averages as we see them on this site have not been (noticeably) manipulated has proven false a number of times, as the staff and anyone following this column from the beginning can tell you.

Side note:

NPC wrote:
Shows in the lower part of the table have less voters,


This isn't necessarily true, as there are some shows that many people are watching but are also usually rated lower than most of the other shows (Black Clover and Boruto come to mind, and Shippuden before that, which had been used by some users here as the line between the not so good and the truly bad).
Back to top
View user's profile Send private message
NPC



Joined: 21 Sep 2016
Posts: 56
PostPosted: Wed Mar 27, 2019 12:30 am Reply with quote
NeverConvex wrote:
This may be how ANN's implementation works, but strictly speaking I don't think this is the fault of using the Schulze method.

Indeed. I never disputed validity or applicability of Schulze method in general for simple reason that all I knew about it is one line ANN summary. But statistics is complex stuff and it is easy to use even perfect instrument incorrectly. For me shows wildly jumping up and down 1/4 of the range without visible reasons is a telling sign of statistics being used incorrectly, be it wrong method, invalid implementation or bad data.

I assumed that the method is used to compare pairs and likely to introduce problems because of lack of transitivity. Wikipedia article is rather enlightening, the method is indeed for group ranking. There is a short list of failed voting system criteria which indicate a few ways to play the system. The method is not intended to protect from them.

zrnzle500 wrote:
To clarify, I'm not taking a position on using averages in general, just on the averages on this site.

Happy coincidence, we are talking about this site and it weird jumping rankings. Not that it changes anything.

Quote:
>Basically, to cheat he needs to vote more than once.
Which is what has happened on this site to these averages.

"You seem to be under some bad misconceptions here". Schulze methods doesn't provide any defense from repeating voting. In fact, as I mentioned, there is very simple and legal method of tactical voting which amplifies voter's power much more than in case of averages. Malicious voter's work is made easier, not harder. Schulze method intended to order voter's choices which can't be easily ordered otherwise, it is not a magical incantation which solves every problem imaginable.

Quote:
The rankings in this column are not affected by these attacks, as those fake ratings are not included in the calculations (they do not publicize the method that they use to do this so that the people behind the attacks don't know how to get around them).

But in this case malicious voters defeated before they affect results, repeating voting is not a problem at all because of secret ANN magic which banishes them and your fear of evil masterminds distorting averages is completely irrational!

Quote:
So long as the averages on this site are vulnerable to spamming fake votes as above, the averages by themselves can't be taken at face value when they significantly diverge from the rankings.

One more time - assigning voters weight based on their behavior makes spamming manipulation easier, not harder. You need less fake voters/cartel members because using correct voting tactic bad guy can significantly amplify the weight of his vote over "usual" voters who don't use the same tactic. The problem we discussing now is not with wildly jumping averages, it is with wildly jumping artificial ratings unrelated to reality. It is amusing that after all the examples I given you still ready to "take for face value" these ratings. It is not amusing enough to continue explaining how irrational it is though.

Quote:
Like I have said before, you have been assuming that the averages on this site are always accurate and haven't been manipulated (at least noticeably).

And like I said before you should stop putting words in my mouth. I didn't say or imply anything like that and I will not waste time discussing my imaginary assumptions.

Quote:
Likewise, if the averages had been inaccurate because they were manipulated, you could not rely on those averages to tell you which show was preferred,

See above, you have a really bad misconception that applying Schulze method to manipulated data magically produces good results. Schulze method doesn't work like that! Goodnight. If you have garbage as input to Schulze, you get garbage as output, it is that simple. Compromised averages means compromised input to Schulze (ordering method) and, inevitably, compromised result.

Sorry dude, I am done. It's like talking with flat Earther. Arguments are ignored, irrational trust in magical ANN filtering of spammers which doesn't protect averages but protects ordering algorithm based on the same exact data (wow!), insistence on assumptions I never made. This is exhausting. Have a nice day.
Back to top
View user's profile Send private message
NPC



Joined: 21 Sep 2016
Posts: 56
PostPosted: Wed Mar 27, 2019 9:39 pm Reply with quote
NeverConvex wrote:
This may be how ANN's implementation works

Turns out yes, it is. I decided to forget about these strange ratings altogether but one thing kept bugging me. Schulze method, by all indications, is a reasonably good method of voting with multiple preferences. It can't produce completely absurd results as in my example. When used correctly.

According to description ANN given they are using it wrong:
ANN article being discussed wrote:
The rankings are computed using the Schulze method, with the variation that unrated titles are considered as abstentions instead of lower than the rated titles. This roughly means that if, out of all people who rated both A and B, 60% preferred A, it will be ranked higher than B.

The thing is, there is NO such "variation". "Variation" here means "we don't use Shulze method in any shape or form, for no reason whatsoever we concocted our own method with unknown, yet obviously absurd characteristics".

Here is another example. 90 people rated show A, 100 other people rated show B. There is only one person who rated both. As per ANN interpretation this one person alone decides which show is better. This is, obviously, absurd.

Now, to the real Schulze method:
Wikipedia article about Schulze method wrote:
The input for the Schulze method is the same as for other ranked single-winner electoral systems: each voter must furnish an ordered preference list on candidates where ties are allowed.
...
Each voter may optionally:
...
keep candidates unranked. When a voter doesn't rank all candidates, then this is interpreted as if this voter (1) strictly prefers all ranked to all unranked candidates, and (2) is indifferent among all unranked candidates.

So, every single voter votes for every single candidate. If voter didn't explicitly rank a candidate, he implicitly ranked this candidate below all those he ranked. None of them ever "abstains", it is impossible.

In ANN case if the method was used correctly if a voter rated only one show, he simultaneously rated all the other shows below this one for purposes of rating comparison. E.g. if he given a single show 4 and didn't vote for anything else, he simultaneously given 0 to all others. There are no "out of all people who rated both A and B", they ALL rated both A and B, explicitly or implicitly.

When this method is applied to the example above, there are 3 cases: voter who voted for both shows ranked A higher, lower or the same as B. Correspondingly, the numbers used by Shulze method for voters who prefer A or B would be (91 and 100), (90 and 101) and (90 and 100), not a single vote is ever ignored.

Every single "The Best and Worst ..." article ANN published for years uses wrong math. This is the reason why the whole article is a big disclaimer "nothing to look at here, it's not our fault that it doesn't make sense" and why it couldn't possibly be used for any kind of reasonable ranking analysis.

This all is just a one huge joke.

If the method was used correctly there wouldn't be any "swinging wildly" ratings and there wouldn't be any show that beats another by 0.5 point of user score yet ranked 5 positions below. In short, it all would make sense.
Back to top
View user's profile Send private message
killjoy_the



Joined: 30 May 2015
Posts: 2459
PostPosted: Thu Mar 28, 2019 3:41 am Reply with quote
^ If simply rating one show meant that you think it's better than all others you didn't vote for, this'd make popularity into a huge deal for the results (popular series gets more votes). So you could totally have something with a 3.0 rating that was rated by 1000 people beat something rated 4.5 that was rated by 100 people.
Back to top
View user's profile Send private message Visit poster's website
NeverConvex
Subscriber



Joined: 08 Jun 2013
Posts: 2292
PostPosted: Thu Mar 28, 2019 7:25 am Reply with quote
killjoy_the wrote:
^ If simply rating one show meant that you think it's better than all others you didn't vote for, this'd make popularity into a huge deal for the results (popular series gets more votes). So you could totally have something with a 3.0 rating that was rated by 1000 people beat something rated 4.5 that was rated by 100 people.


Yeah, I think this is true. If abstentions are identified with a "lower than lowest" possible rating, like 0 on a 1-5 scale, and if the 1000 show A & 100 show B people in this hypothetical are non-overlapping, then the 1000 people would be assumed to prefer show A (which they ranked) to show B (which they didn't), and likewise the 100 people would be assumed to prefer show B to show A, so the strength of the strongest path from A to B would be at least 1000, and from B to A at least 100. And all other paths would be worse than these, because their strength can never be larger than the number of people who voted on A or B. So show A with the 1000 people would win.

In the other extreme, we could suppose the 100 people rated both shows, and that they all rated B better than A. Then the remaining 900 people would have rated (ranked) A better than (unranked) B, while the 100 people would have done the opposite. So the strength of A-B would be at least 900, and the strength of B-A at least 100. Again strengths can never be larger than the number of people who voted on a show, so A will win.

That said, I do not think this is what the ANN description NPC quoted meant. I parse "with the variation that unrated titles are considered as abstentions instead of lower than the rated titles" as meaning that, in ANN's variation on the Schulze method, if person # i voted on A but not on B, then in counting how many people preferred A to B, person # i would simply be dropped from the total.

If that's the case, then if the 1000 and 100 people are non-overlapping then the 1000 people do not contribute to the strength of the direct A-B path, and the 100 people do not contribute to the strength of the direct B-A path. It gets more complicated now, so, for simplicity, assume the 100 people all voted exactly 4.5 on B and the 1000 people exactly 3.0 on A. We know the direct paths in both directions have strength 0.

But there could be a third show, show C, and the 100 show B people might have all ranked show C as a 3, and then there could be a different group of 100 people (since the 100 people never voted on A) who voted C as a 5 and show A as a 4. This B-A path would then have strength 100, so strength(B,A) >= 100.

Now maybe the 1000 show A people all thought show C was awesome, and they all voted show C as a 5 -- higher than their 3.0's on show A, so there could be no non-zero strength A-B path through C. And maybe this is all the shows and all the votes. Then the A-B strength would be 0, and the B-A strength 100, so show B would win.

NPC wrote:
Here is another example. 90 people rated show A, 100 other people rated show B. There is only one person who rated both. As per ANN interpretation this one person alone decides which show is better. This is, obviously, absurd.


I don't think this is quite true. 90 people rated A; 100 people rated B. There's only one person in common between them. But if the A-people and the B-people all voted on a third show, C, and some (possibly completely different) group of people voted on whether C is better than A or B, then you could still get non-zero, strength-greater-than-1 paths from A to B or from B to A, so more than just the 1 person-in-common could matter.

In general it seems quite hard to build intuition for how Schulze works without actually drawing some graphs, and coding it up to apply to & visualize some small-scale examples.
Back to top
View user's profile Send private message
NPC



Joined: 21 Sep 2016
Posts: 56
PostPosted: Thu Mar 28, 2019 12:59 pm Reply with quote
killjoy_the wrote:
^ If simply rating one show meant that you think it's better than all others you didn't vote for, this'd make popularity into a huge deal for the results (popular series gets more votes). So you could totally have something with a 3.0 rating that was rated by 1000 people beat something rated 4.5 that was rated by 100 people.

You are indeed correct, this is how Schulze method works. More popular show wins, as it should be. If 1000 people cared to rate one show and 100 the other it is obvious which one is more popular regardless of averages.

In particular case of Index vs Boogie (+0.5 score difference, -5 positions) though I am fairly certain that Index would win. ANN statistic is hidden, but according to MAL they have about the same number of active viewers (slightly more for Boogie). I would expect a significant intersection between the viewers of the two to exist, this intersection likely to affect the balance considering large difference in averages (this is where they play their role). I think at the very least there wouldn't be 5 positions spread between them. Use method which makes sense - get results which make sense and actually interesting to look at.

And yes, the whole idea of Schulze method is to be used in popularity contest. If used properly, it creates an ordered list of candidates, from most to least popular. Scores are used for comparison only, they are not part of the result. Now, what exactly the ANN "variant" produces is completely unknown - this is not Schulze method, this is the method without a name which is invented and exclusively used by ANN. It produces something weird.

The trivial way to deal with popularity and score is to have two graphs, "popularity" and "average score". The current ANN graphs are "something weird" and "something weird, accumulated". The only reason for second graph's existence is to try and even out mad dance of insane rankings in the first one caused by usage of the crippled rating method. Band-aid over an axe wound.

NeverConvex wrote:
Yeah, I think this is true. If abstentions are identified with a "lower than lowest" possible rating, like 0 on a 1-5 scale, and if the 1000 show A & 100 show B people in this hypothetical are non-overlapping, then the 1000 people would be assumed to prefer show A (which they ranked) to show B (which they didn't), and likewise the 100 people would be assumed to prefer show B to show A, so the strength of the strongest path from A to B would be at least 1000, and from B to A at least 100. And all other paths would be worse than these, because their strength can never be larger than the number of people who voted on A or B. So show A with the 1000 people would win.

And I think no sane person would argue with this result.

Quote:
In the other extreme, we could suppose the 100 people rated both shows, and that they all rated B better than A. Then the remaining 900 people would have rated (ranked) A better than (unranked) B, while the 100 people would have done the opposite. So the strength of A-B would be at least 900, and the strength of B-A at least 100. Again strengths can never be larger than the number of people who voted on a show, so A will win.

Which is again, completely justified. When you have 1000 people voting for one show and 100 for other, there is very little doubt which one is more popular. Sure, among the 1000 there can be 800 haters persistently giving show 1 to sink it, but they are only hurting themselves and they are malicious voters anyway. ANN would just casually throw 900 perfectly good votes into garbage can. No wonder results defy logic.

Quote:
That said, I do not think this is what the ANN description NPC quoted meant. I parse "with the variation that unrated titles are considered as abstentions instead of lower than the rated titles" as meaning that, in ANN's variation on the Schulze method, if person # i voted on A but not on B, then in counting how many people preferred A to B, person # i would simply be dropped from the total.

Yep. And the moment it does that it stops being Schulze and becomes something completely different.

Using the same almighty voters as in 90 vs 100 example it is easy to build an example with 3 shows, A, B and C where A is better than B, B is better than C and C is better than A. Transitivity of Shculze is lost. Using ANN method to build sane ranking is impossible.

Quote:
I don't think this is quite true. 90 people rated A; 100 people rated B. There's only one person in common between them. But if the A-people and the B-people all voted on a third show, C, ...

Naah, see, at this point you are altering the intentionally simple example and as soon as you do that you can't claim that original conclusion was wrong. You moved the goal post. Pair comparison like in this example is the corner stone of Schulze, if it is broken there is nothing left.

Quote:
In general it seems quite hard to build intuition for how Schulze works without actually drawing some graphs, and coding it up to apply to & visualize some small-scale examples.

It seems rather simple to me. People who didn't care to rate a show gave it 0 which makes it possible to compare every voter's opinion about any two shows. So, you can always easily calculate how many people prefer one show to another. The rest is just finding the order most consistent with these numbers.

Now, to understand how ANN method "works" is a real headscratcher. I think it just doesn't work because there seem to be not a shred of sanity in it.
Back to top
View user's profile Send private message
NeverConvex
Subscriber



Joined: 08 Jun 2013
Posts: 2292
PostPosted: Thu Mar 28, 2019 1:54 pm Reply with quote
NPC wrote:
Naah, see, at this point you are altering the intentionally simple example and as soon as you do that you can't claim that original conclusion was wrong. You moved the goal post. Pair comparison like in this example is the corner stone of Schulze, if it is broken there is nothing left.


The Schulze method is built on pairwise comparisons, yes, but it is not built on direct pairwise comparisons. It attempts to compare show A to show B not simply by directly asking how many people preferred A to B, but also asking how many people (pairwise) preferred A to C to B, and pairwise preferred A to C to D to E to B, and A to D to B, and A to E to D to B, etc (note that the pairwise comparisons from A to C and C to B and D to E do not in general involve the same people!), and then it maximizes over these paths (and also minimizes within each path; there are two quantifiers, and they both matter, and that is what makes it complicated).

I don't understand what you mean when you say I altered your example. In your example, you told us who rated show A and who rated show B and who rated both. But you didn't tell us whether other shows also exist, and - if so - who voted for them and how. My extended version of your example was my way of pointing out that that missing information matters, because the Schulze method cares about more than just the people who voted for A and B.

NPC wrote:
It seems rather simple to me. People who didn't care to rate a show gave it 0 which makes it possible to compare every voter's opinion about any two shows. So, you can always easily calculate how many people prefer one show to another. The rest is just finding the order most consistent with these numbers.

Now, to understand how ANN method "works" is a real headscratcher. I think it just doesn't work because there seem to be not a shred of sanity in it.


I don't have a well-formed opinion on ANN's variation on Schulze at this point, as I think it is quite complicated. I agree that the version of Schulze where everyone's preferences are always included but set to 0 if they didn't vote is much simpler (although not as simple as you suggest here). I am not sure if it is better.

EDIT:

I do agree that a multi-dimensional measure, as you suggest here, would be cleaner:

NPC wrote:
The trivial way to deal with popularity and score is to have two graphs, "popularity" and "average score".


However, that would complicate the visualizations ANN likes to use, because those are already 2-dimensional (popularity plotted against week/episode number or some such). If they wanted to graph all 3, they'd have to move to a plot with a third independent dimension, and be prepared to explain and interpret it. That's not a trivial thing to do (but also not impossible; 2 is easy, 3 is not so bad, 4 is hard, 5 is a nightmare, at 6+ you're just screaming into peoples' eyes).
Back to top
View user's profile Send private message
zrnzle500



Joined: 04 Oct 2014
Posts: 3767
PostPosted: Thu Mar 28, 2019 5:30 pm Reply with quote
NeverConvex wrote:
However, that would complicate the visualizations ANN likes to use, because those are already 2-dimensional (popularity plotted against week/episode number or some such). If they wanted to graph all 3, they'd have to move to a plot with a third independent dimension, and be prepared to explain and interpret it. That's not a trivial thing to do (but also not impossible; 2 is easy, 3 is not so bad, 4 is hard, 5 is a nightmare, at 6+ you're just screaming into peoples' eyes).


I think varying the size of the marker by how many people are following the show (on this site at least) would be fairly intuitive, but with how busy the chart is already, it wouldn't be very readable (especially on mobile) and you wouldn't have very much room to work with before stuff starts overlapping. They could have two separate charts, one based on ranking and the other based on number of viewers, and users could just toggle between them, but I don't know how plausible that would be in their setup. The simplest way would be to add another column with the rank by number of viewers (or popularity or whatever language they want to use), and stick it after the anime title, so people don't confuse the two ranks. I don't know that the number of people following each show on this site changes all that much relative to the others, so you probably don't need to show the previous weeks. Though, even that might be difficult to fit into the space they have on the page, given how long some of the titles are.
Back to top
View user's profile Send private message
NPC



Joined: 21 Sep 2016
Posts: 56
PostPosted: Thu Mar 28, 2019 6:23 pm Reply with quote
NeverConvex wrote:
I don't understand what you mean when you say I altered your example. In your example, you told us who rated show A and who rated show B and who rated both. But you didn't tell us whether other shows also exist, and - if so - who voted for them and how.

It is very simple - the goal of the example is to present the simplest situation in which a problem occurs. Therefore, for the sake of this example other shows don't exist and any considerations related to them are irrelevant. Infinite number of modifications to the example is possible, considering them all at the same time is both impossible and pointless. This is an example where just two shows are compared as per ANN "variation" description. Adding something produces new, completely different example and any conclusions made for original one are not expected to be valid anymore.

I though it was obvious, I should have said so explicitly. It seems you thought I made general statement for all possible cases, sorry about that.

Quote:
I don't have a well-formed opinion on ANN's variation on Schulze at this point, as I think it is quite complicated. I agree that the version of Schulze where everyone's preferences are always included but set to 0 if they didn't vote is much simpler (although not as simple as you suggest here).

Again, nothing complex here. All voters vote for every candidate is the only situation for which Schulze method is applicable, it is the case for which the method was designed and this is the case where it behavior is known. All the considerations in Wikipedia are not related in any way to "variation" made by ANN because such variation doesn't exist as far as Wikipedia is concerned.

Wikipedia does mention one other variant of Schulze method: "for proportional representation elections, a single transferable vote variant has been proposed." No, this is not the variant used by ANN. Note that this variant has it own name, "Schulze STV" and long detailed article is devoted to it. Modifying any part of voting evaluation method creates a new method with new different behavior which has to be carefully evaluated for sanity of results. ANN did nothing of the sort.

If ANN defines their own method (even if it has some elements used elsewhere) they can't honestly claim that they are using Schulze method. It's like saying "we are baking apple pie, the variation where cement is used instead of flour". See, apple pie recipe doesn't allow such "variation". No matter how strict they follow the rest of the recipe, whatever ANN is baking, it is certainly not an apple pie and shouldn't be treated as such under any circumstances. And the method they use here is certainly not Schulze.

Quote:
I am not sure if it is better.

Wait, you are not sure that using well known, reviewed, tried and tested for decades method which is guaranteed to work is better then something no one else ever used with obvious examples of absurd behavior? Hmm. I just don't know what to say to that really. I guess, "why?"

Quote:
I do agree that a multi-dimensional measure, as you suggest here, would be cleaner:

I did not, really. It is two separate two-dimensional graphs showing change over time of a) popularity ranking as determined by Schulze method and b) average user scores as shown in episode reviews. Nothing changes regarding the time line, it just that data for the graphs now actually make sense and is interesting. It couldn't be simpler.

Quote:
However, that would complicate the visualizations ANN likes to use, because those are already 2-dimensional (popularity plotted against week/episode number or some such). If they wanted to graph all 3, they'd have to move to a plot with a third independent dimension, and be prepared to explain and interpret it. That's not a trivial thing to do (but also not impossible; 2 is easy, 3 is not so bad, 4 is hard, 5 is a nightmare, at 6+ you're just screaming into peoples' eyes).

It wouldn't complicate anything. Ranking graph can use exactly same method as now, change of ranking over time, just showing sane data this time. The second graph - trivial graphic of average score for episodes in 0-5 range over time. I have no idea what 3rd value and 3rd dimension you are talking about or where all the complexity come from. I certainly didn't suggest anything like that.
Back to top
View user's profile Send private message
Dan42
Chief Encyclopedist


Joined: 02 Jan 2002
Posts: 3782
Location: Montreal
PostPosted: Sat Mar 30, 2019 12:58 am Reply with quote
Ooohhhh, someone dares to criticize my perfect model and I'm late to the party? Aw, shucks.

@NPC, just because you don't understand something doesn't mean it's insane. But, meh, I'll give you a C+ for trying, at least.

@zrnzle500, I raise my hat to you for your patience and civility in the face of shocked incomprehension.

@NeverConvex, bravo, superb math. Yeah, you really nailed how it works.

The default version of Schulze, where abstentions are marked as lowest, is from what I understand meant to operate in a political context. In other words when you go to the poll you're supposed to have informed yourself on the political parties, and any unranked means they are considered as "not serious" and should not win. Most voters will input a preference for every major party.

In contrast, unranked anime is more likely to mean "I don't have time to watch all shows this season" or "I haven't caught up to the more recent episode for this series". Most voters will NOT input a preference for every major series (even if anyone could agree on what are the major series). The vote is much more fragmented. If I were to apply the default version we'd wind up with a straight popularity ranking. That seems to be what NPC wants, but not me, I'd find that supremely useless. Schulze is NOT supposed to be a mere popularity ranking, otherwise we'd just need to sort by number of votes, no need for any fancy calculations. Schulze is a preference ranking.

Now, Mr. NPC, you seem to have just discovered the Schulze method and gone to the Almighty Wikipedia and absorbed what's in there as The Ultimate Truth Handed From God. But that's not how it works. Without proper judgment, dogmatic application of "the recipe" is just as likely to be a recipe for failure. The variation I introduced is not something unholy and unthinkable like you seem to think, but something born out of understanding the Schulze method and the characteristics of the data being analyzed. That's how math and science in general works. There's no dogma. In fact I'd be very interested in knowing what Mr. Schulze thinks of my little variation. It does increase intransitivity tremendously, but I think that's better than having the outright falsehood of "I think all the series I voted for are superior to everything else". Because at that point falsehood in = falsehood out.

NPC, you need to study up a bit more. You talk about "Transitivity of Shculze is lost" and yet Schulze does not and has never guaranteed transitivity! See Arrow's theorem.

There's also the rather big issue that there is no possible logical relationship between "beating" another series and having one's rank go up or down. Think about it. If Boogiepop beats W'z, does that means it should go up in ranking? In that case it should go up in ranking every week because every week it's better than W'z. In fact every series is better than W'z so they should all go up every week?

The relationship between the weekly and cumulative graphs is this: if weekly position is higher that cumulative position (ex: weekly #4 < cumulative #14), cumulative will tend to go up. Regardless of whether weekly ranking went up or down from the previous week. If you have trouble wrapping your mind around that, try to pause and think a bit more.

And maybe it's worth pointing out one more time something you can't seem to comprehend no matter how many times zrnzle500 explains: the community score includes spam votes. We use some heuristics to remove spam votes before running the Schulze algorithm. So yes, the Schulze method doesn't magically remove spam. It's the filtering prior to the Schulze method.

On top of that, you're mistaken about averages being less susceptible to foul play. If a show has few votes it's very easy for a spammer to push it up or down. But with pairwise methods (such as Schulze), even if you successfully manage to spam A>B into becoming B>A, that pairwise comparison is "interlocked" with the other 25 shows. So it's likely that in the overall picture it will be ignored like statistical noise by the Schulze method itself.

You also seem to be upset about shows "wildly jumping" up and down, but how is that supposed to be a problem? It should be obvious that comparing an entirely different set of episodes week to week should yield entirely different results. If series A is better than B one week, there's no guarantee whatsoever that the same will hold true next week. And yet if you look at any given series, even with all the jumping around the top series tend to stay in the top and the bottom series tend to stay in the bottom. The graph really does show that series tend to stick around a certain "level" despite the normal weekly volatility.

Come on, make some effort! As it is, your ramblings are so nonsensical they don't have the slightest bite!
Back to top
View user's profile Send private message Visit poster's website AIM Address My Anime My Manga
Display posts from previous:   
Reply to topic    Anime News Network Forum Index -> Site-related -> Talkback All times are GMT - 5 Hours
Goto page Previous  1, 2, 3  Next
Page 2 of 3

 


Powered by phpBB © 2001, 2005 phpBB Group