In southeastern Seoul, there is a sashimi restaurant with a 4.2 Kakao rating. On the surface, it does not look bad. But 64% of this restaurant’s reviewers were non-discerning raters. These are people who give 4.5 stars or higher to nearly every place they visit. Once weighting is applied, the Score drops to 3.38.
In the same area, there is a Western restaurant with a 4.2 Kakao rating. This one has 64 Gold reviewers, and the share of non-discerning reviews is 23%. Its weighted Score is 4.2. It matches the Kakao rating exactly.
On Kakao Map, both restaurants show the same 4.2 rating. Weighted analysis separates them by a 0.8-point gap. If in #1 we confirmed that the accuracy of star ratings depends on the quality of the reviewers, this article explains how that “quality” is quantified — from the design logic of the weighting system and noise reduction methods to on-site validation.
Credibility Is Determined by Two Axes
Every reviewer’s credibility is calculated as the product of Volume × Discrimination. The scale and reference points for each factor were intentionally set after analyzing the distribution of review data.
Volume uses a logarithmic scale, with the upper cap set at 100 reviews. A reviewer with 10 reviews scores 0.52, 50 reviews scores 0.85, and 100 reviews reaches 1.0, with no further increase beyond that. Whether someone has written 300 reviews or 1,000, their Volume contribution is the same. The idea is that while more reviews generally increase credibility, beyond a certain point “wrote more” does not automatically mean “more trustworthy.”
Discrimination is an estimate of a reviewer’s ability to distinguish between truly good places and less impressive ones. Ideally, we would measure each reviewer’s score distribution directly using standard deviation, but the Kakao Map API provides only a reviewer’s average star rating and number of reviews. We cannot access each reviewer’s full rating distribution. So we decided to use average star rating as a proxy for discrimination. The curve is centered at 3.2 and designed to converge toward 0 at both extremes.
Why 3.2? On a five-point scale, maintaining an average of 3.2 means giving 4–5 stars to good places and 2–3 stars to ordinary ones. Ratings only become informative when there is variation in them.
A reviewer who gives 5 stars to every place adds no information with that 5-star rating. That person does not distinguish whether this place is truly special or not.
Of course, average rating is not a perfect indicator. Some reviewers with an average of 4.2 may in fact use the full 1–5 range. But when we cannot inspect each reviewer’s full score distribution, average rating is the most practical proxy for estimating discrimination. The center point of 3.2 was chosen based on the observation that reviewers in the 3.0–3.5 range showed the greatest variation in scores across venues.
On this curve, the range where discrimination is 0.5 or higher is roughly an average rating of 2.0–4.4. Gold reviewers are those within this range who also have at least 50 reviews and an average rating between 2.5 and 4.2. Their Volume is at least 0.85 and their Discrimination at least 0.65 — placing them in the upper tier on both axes. This threshold was set at the intersection of “the range that naturally receives high weights in the formula” and “the range whose judgment quality was confirmed through manual review.” Out of 1.75 million reviews, 12,911 reviewers meet these conditions.
A base value (eps) of 0.05 is added to the final weight. The design principle is that even reviewers with very little history or highly extreme tendencies are not ignored completely. However, their influence remains at about one-fifteenth that of a Gold reviewer.
The Same 5 Stars, 15 Times Different in Weight
Let us see what difference this formula makes in practice. Two reviewers both gave 5 stars to the same restaurant. One has written just 1 review and has an average of 5.0. The other has visited 200 places and maintained an average of 3.3.
Applying Volume × Discrimination, the first reviewer’s weight is 0.07, while the second reviewer’s is 1.04. A 15-fold difference.
A reviewer with just 1 review and an average of 5.0 has two weaknesses at once. Volume is extremely low (too little review history), and Discrimination is also extremely low (the average is too extreme). It is a bottom-tier case on both axes.
The more interesting case is the reviewer with 50 reviews and an average of 4.8. The review count is not small, yet the weight is only 0.29. Volume is high at 0.85, but low Discrimination drags down the final weight. If someone visits 50 places and still gives mostly 4s and 5s, it is hard to get a useful answer from them to the question, “Is this place actually special?”
Filtering Out Noise — Non-Discerning Reviewers
Assigning weights alone is not enough. That is because the system has its own inherent source of noise.
Non-discerning reviewers are raters who show a pattern of giving high scores to nearly every place they visit. They already receive low weights on the Discrimination curve in Section 1, but their share itself also serves as an indicator that lowers the reliability of a venue’s data.
The detection criteria were designed in three stages based on review count.
| Stage | Average rating | Minimum reviews | Logic |
|---|---|---|---|
| Stage 1 | 4.9+ | 3 reviews | Effectively always gives 5 stars |
| Stage 2 | 4.7+ | 7 reviews | High average sustained over 7+ reviews |
| Stage 3 | 4.5+ | 25 reviews | Extremely high average even after 25 reviews |
The more reviews someone writes, the harder it is to maintain a very high average. That is why the thresholds were designed progressively. Maintaining a 4.9 average over 3 reviews and maintaining a 4.5 average over 25 reviews showed similar levels of bias in the data, and the thresholds for each stage were derived from that observation.
Among 81,679 venues, 7,507 places (9.2%) were found to have non-discerning reviews making up more than 40% of all reviews. In other words, roughly 10% of star ratings appear to be inflated by this kind of bias.
Not One Number, but Three Axes
This system does not produce a single score. It evaluates venues along three independent axes.
First, the weighted positive rate. This is the sum of weights for reviews rated 4 stars or higher divided by the sum of all weights. Rather than simply measuring the share of 4-star-and-up reviews, it looks at how much of that share comes from highly credible reviewers. The decision thresholds are set at 75% / 50% / 30% — 75% or above is “great eat,” 50% or above is “good,” 30% or above is “average,” and below that is “caution.”
Second, the bubble index. A venue is marked “caution” if the share of non-discerning reviews exceeds 40% or if the gap between the Kakao rating and the weighted Score exceeds 0.5 points. This axis is designed to detect places with high Kakao ratings but biased reviewer composition.
Third, data reliability. This is judged based on effective sample size (Neff) and the number of Gold reviewers. For venues with at least 50 reviews, the average number of collected reviews is 103, but after weighting, the effective sample is compressed to 76.2. That is a compression ratio of 74.4%. In practical terms, out of 103 reviews, only about 76 contain genuinely independent information.
Average 103 reviews → effective 76.2
The most trustworthy great eats are the venues that perform well on all three axes at once — weighted positive rate of 75%+, bubble status “clean,” and data reliability “high.” If even one axis raises a warning, the reason is shown alongside it. This is a system where context, not a single number, forms the judgment.
Validated on the Ground — 336 Venues in Songpa-gu
To check whether the formula matches reality on the ground, we reviewed the analysis results for 336 venues in Songpa-gu with at least 50 reviews. The original data were extracted from Kakao Map, including each reviewer’s review count, average rating, and rating distribution, and the same weighting formula was applied.
The Kakao rating and the weighted Score differed consistently.
| Category · Area | Kakao | Score | Gap | Gold | Bubble |
|---|---|---|---|---|---|
| Gopchang · Songpa (Songpa) | 4.3 | 3.23 | +1.07 | 24 | Caution |
| Hoe (raw fish) · Jamsil (Jamsil) | 3.5 | 2.50 | +1.00 | 17 | Caution |
| Sashimi restaurant · Jamsil | 4.2 | 3.38 | +0.82 | 9 | Caution |
| Chinese · Songpa (Songpa) | 4.2 | 4.04 | +0.16 | 76 | Clean |
| Naengmyeon · Bangi (Bangi) | 3.9 | 3.85 | +0.05 | 100 | Clean |
| Western · Songpa (Songpa) | 4.2 | 4.20 | 0.00 | 64 | Clean |
| Sushi · Songpa (Songpa) | 4.8 | 4.82 | −0.02 | 9 | Clean |
| Shabu · Bangi (Bangi) | 4.5 | 4.42 | +0.08 | 90 | Clean |
Among the 336 venues, the most dramatic gap appeared in Gopchang · Songpa (Songpa). Its Kakao rating was 4.3, but its weighted Score was 3.23. The gap was +1.07. The share of non-discerning reviews was 32%, and its bubble grade was “caution.” The Kakao rating was inflating the actual dining assessment by more than one full point.
The sashimi restaurant in Jamsil had a Kakao rating of 4.2, but the share of non-discerning reviews reached 64%. Two out of every three reviewers showed a pattern of giving high scores wherever they went. Its weighted Score was 3.38.
By contrast, the Western restaurant in Songpa with the same Kakao rating of 4.2 had 64 Gold reviewers, a bubble grade of “clean,” and a weighted Score of 4.20 — exactly matching Kakao. Shabu · Bangi had the highest number of Gold reviewers among the 336 venues at 90, and remained stable with a Score of 4.42 against a Kakao rating of 4.5.
Looking only at Kakao ratings, Gopchang · Songpa (4.3) and Western · Songpa (4.2) appear similar. The weighted system separates them by a 0.97-point gap.
Three Core Elements of the Decision Criteria
“Who gave this place its star rating?” — once you trace the history and patterns of those reviewers, meaningful differences emerge between venues that looked identical on Kakao Map. Not reading all 1.75 million reviews as if they carry the same weight: that is the starting point of this restaurant evaluation system, now applied to 93,515 venues.