Introduction to Benchmarking Forecasters

Introduction

Forecasting is a difficult field to make comparisons across because…

The Superforecaster Median Forecast

Professor Tetlock’s work identified that some humans are reliably very good at predicting future events, much better than the average forecaster and even better than the wisdom of crowds. Further, this effect could be increased by grouping Superforecasters together and using their specific wisdom of crowd effect. It is almost impossible for individual humans to reliably beat this level of forecasting ability, so it is like “Humanity’s Last Exam” in the forecasting world. Helpfully, Forecast Bench shows what this effect looks like. If an omniscient being who could predict everything accurately would score ‘100’ (i.e. their Brier score is 0 because they are always 100% accurate), then their group of superforecasters would score ’70’, whilst the wisdom of crowd would be ’65’ (and an individual average human would typically be around ’50’). Cassi scores at ’68’; or to put it another way, Cassi is much better than the average person; significantly better than the wisdom of crowds; the best of the AI forecasting bots; like having a Superforecaster of your own*; and approaching the quality of forecast you would get if you had a whole team of Superforecasters.

*It is difficult to generate this figure accurately because the Forecasting platforms tend not to openly release forecast data for individual human forecasts, but our calculations place Cassi in the bottom third of human superforecasters. On Metaculus’ recent forecasting competition with Bridgewater, Cassi performed in the top 2% of human forecasters, 46^th out of 3000.

Understanding Forecasting Competitions

The various tournaments and competitions that Cassi participates in all have distinctive features that makes comparisons across them very difficult (and hence why we use a % better than metric).

Forecasting scores on any competition are related to, in rough order:

How difficult the questions are.
How often a forecaster is permitted to forecast on a given question.
How much forecasting on more questions is beneficial.
How skilful the forecaster is.

Some questions are more difficult than others. For Polymarket and Kalshi (and to a degree Metaculus), where the users propose the questions, there is nothing preventing them from being very easy indeed, and they sometimes are. Conversely, sometimes questions are chosen deliberately to be challenging and obscure (Forecast Bench and Metaculus). A more accurate forecast on the latter, even if in absolute terms a lot less accurate than an accurate question on the former, is much more difficult to achieve and much more valuable. But it is in strict terms “a worse score”. Both Metaculus (by using score relative to everyone else, not an absolute accuracy score) and Forecast Bench (by weighting different questions depending on how difficult they proved when making the score) make this adjustment internally within their platform.

Some tournaments are set up so that a forecaster may only make a single forecast. Most Metaculus’ AI-only forecast competitions are like this, but the annual ACX competition and others are like this too. Forecasters are then being compared as to how accurate they are on the same date. But in many tournaments, and usually for real-world forecasting, forecasters are allowed to forecast as often as they like. The scores in the latter are thus a lot better, because one of the keys to accurate forecasting is updating appropriately on new information: or to put it plainly, it is a lot easier to predict what will happen in December when you are in November than it was back in January. However, having a somewhat accurate forecast than everyone else 12 months out is more impressive and will often be more useful than a more accurate forecast 1 month out that everyone else shares. Again, there is no really fair comparison of such different types of tournaments, when one way is just a lot easier than the other.

Some competitions and comparisons are based on cumulative points scored, some on accuracy as long as a given number of questions are forecast. It is roughly the same principle as batting averages in cricket and baseball: is the best player the one with the highest average, assuming a non-trivial number of innings or at-bats? Or the best player the one that scores or otherwise creates the largest absolute number of runs? How much are you rewarding industry over efficiency?

AI forecasters, for obvious reasons, could win competitions hands-down on industry if that were the only thing rewarded. And over limited sets of questions, some humans can be motivated to forecast a lot too, to improve their score.

Accuracy is also not necessarily the most important thing, otherwise the forecasters who picked only the most certain of questions to forecast on would be rewarded for their cherry picking.

So overall, there is no simple way to compare one forecaster with another over sets of different questions. If you want to look for the most accurate forecasters, look at who does consistently well compared to other forecasters in the same competition or over the same set of questions, and prefer large numbers of questions to small numbers of questions.

Dataset & Market Questions on ForecastBench

‘Dataset’ questions are those questions where there is no reference forecast anywhere, every forecaster has to make their own best estimate. This happens when a question has been set that no other forecasting platform, prediction market or bookmaker has asked. This is a lot harder than ‘Market’ questions, which are when one or more of those reference forecasts do exist. Essentially, it means a forecaster can anchor their own forecast on the existing ‘wisdom of crowds’ and only offer a different forecaster if they believe that they know better, for some reason. It is also possible that ‘Market’ questions are less obscure generally. Between these two factors, forecasting on ‘Market’ questions is usually much easier than forecasting on ‘Dataset’ questions. For that reason, we at Cassi are also proud of how our forecasting AI that does not look at reference forecasts is doing on Forecast Bench (currently 17^th out of 265, as at 11 Mar 26).

Benchmarking at Cassi

At Cassi, we compare ourselves using percentages derived from Metaculus’ Peer Score (based on log-loss Scoring rule – Wikipedia) and Forecast Bench’s Difficulty-Adjusted Brier Score Scoring rule – Wikipedia. The scoring rules themselves are a little difficult, but they can be simplified to show how much better or worse the forecast was than the median forecaster on Metaculus or Forecast Bench.

The reason for these adjustments is that some questions are much easier than others. For example, predicting whether the Earth will still exist tomorrow is easier than what the FTSE 100 will be in 10 years’ time. This makes naive comparisons of scoring and accuracy deceptive: for comparison purposes, what matters is how one forecaster performed compared with another on a similar set of questions. You can’t really compare accuracy between different sets of questions, it would be like comparing apples to orangutans. So, we calculate the score that Cassi has over the sets of questions and tournaments it forecasts on and compare it to the reference groups already mentioned. The percentage we give is how much better or worse that score is, normalised so every question across all the platforms and tournaments is of equal weight.

The Average Tournament Forecaster

The average tournament forecaster is the median/fiftieth percentile performer in a tournament. We assume this is the closest figure to what an ‘average adult human’ would get, although there is probably a selection bias effect, such that people whose hobby is forecasting could well be a bit better than average and a truly random sample would be somewhat worse. Furthermore, we know that receiving feedback on your forecasts, improves your forecasting accuracy and calibration, so we can be reasonably confident those whose hobby is forecasting are not a representative sample, and so our performance is likely further ahead of most forecasters, in most organisations, than the cautious headline result here shows. The effect is potentially exacerbated by the fact that the average forecaster does not forecast on every question, they naturally forecast more on those areas they are interested in or have some expertise in, or both. It is a reasonable proxy for a human who has expertise in a field but is not practised in, selected for or otherwise good at forecasting. This is deduced from Professor Tetlock’s findings in Expert Political Judgement and Superforecasting that the average expert was not much better than anyone else at predicting uncertain geo-political events.

The Wisdom of Crowds

When aggregated together, the forecasts of crowds are much more accurate than those of the majority of the members in it Wisdom of the crowd. Forecasting platforms (and prediction markets, and betting sites) use this property to produce reliably good forecasts about the future and are hard to reliably improve upon. This level of accuracy is roughly that which an organization could hope to achieve if it devoted effort to collecting the forecasts of a significant number of its members. Regularly beating this number is a really good test for both human and AI forecasters. It implies that if these forecasts were bets, then the forecaster would beat the bookies. This is what participants on forecasting platforms like Polymarket and Kalshi are trying to achieve, and conversely the best crowd forecast is what the platform is trying to achieve.

The Wisdom of the AI Crowd

Different AI forecasters can be counted as a crowd just as a crowd of humans can, and produces reliably more accurate forecasts (and some AI forecasters are really aggregated already, since their final forecast is derived from the mean of an ensemble of model forecasts). Beating this with a single AI forecaster is very difficult.