

Software developer and data geek with 18+ years delivering web, mobile, and defense systems that ship to production. My focus is on analytics platforms, APIs and developer tooling. My open-source work ranges from compilers and automation frameworks to GIS data products. I weave AI-assisted workflows into day-to-day engineering to accelerate delivery and quality.
Over a decade ago, I was watching an episode of Brain Games with my girlfriend at the time. The show ran all sorts of psychology experiments on unsuspecting participants, but one stuck with me. They put a gumball machine in front of about 20 people and asked each one to guess how many gumballs were inside. The guesses were all over the place: some way too high, some way too low, none particularly close. But when they averaged all 20 guesses together, the group landed within 22 of the actual count of 2,447. That's 99% accuracy from a crowd of people who had no idea what they were doing.
The concept has a name: "wisdom of the crowds," and it turns out the Brain Games demo was a lightweight version of experiments that go back over a century. In 1906, the statistician Francis Galton collected 787 guesses at a country fair where people tried to estimate the weight of a slaughtered ox. The crowd included butchers, farmers, livestock experts, and random fairgoers. Individual guesses varied wildly. But the average of all 787 guesses was 1,197 pounds. The actual weight was 1,198. Off by one pound, and more accurate than any individual expert at the fair, including the cattle specialists. A century later, Michael Mauboussin ran the same experiment with 73 Columbia Business School students and a jar of 1,116 jelly beans. Guesses ranged from 250 to 4,100. The group average landed at 1,151, just 3% off, and only 2 out of 73 students managed to beat it individually. The crowd outperformed 97% of the people in it.
The mechanism is elegant: individual errors are random and tend to point in different directions, so they cancel out when averaged. One person overshoots, another undershoots, and the noise washes away leaving the signal. No single person needs to be right. The crowd converges on the right answer because errors are independent and unbiased. I already relied on this concept in programming via Monte-Carlo simulations but never put the two together. The idea behind Monte-Carlo approach is to take random samples to approximate an answer cheaply instead of performing an expensive calculation to obtain an exact solution.
Wisdom of the crowds is a lot more common than initially seems: Google ranks pages by how many other pages link to them, Reddit surfaces answers by collective upvote, and Rotten Tomatoes aggregate scores are more trusted than any single critic.
Generative AI is running the jelly bean experiment at a scale Galton couldn't have imagined. Instead of averaging 787 guesses about an ox, an LLM is trained on the collective written output of millions of people: their code, their arguments, their explanations, their opinions. When it answers a question, it's not reasoning from first principles. It's returning the statistical consensus of its training data, the answer that the crowd of humanity converged on. Ask it to explain photosynthesis and it gives a better explanation than most biology teachers could improvise, because it has effectively averaged every explanation of photosynthesis ever written and distilled the signal from the noise.
This is genuinely impressive, and I think most people underestimate what's happening when they interact with these models. The technology isn't doing anything mystical. It's applying the same principle that made Galton's crowd of random fairgoers more accurate than expert cattle judges: aggregate enough independent perspectives and the errors cancel out. The difference is scale. Instead of 787 people guessing at one question, it's billions of data points across every domain humans have written about. The result is a system that gives the right answer to most questions, most of the time, with a fluency that makes it feel like it actually understands what it's saying.
For roughly 80% of the problems, the crowd has converged on the right answer and the output is solid. Boilerplate code, standard explanations, common debugging patterns, well-established concepts: the training data contains thousands of examples done correctly, and the averaged result is better than what most individuals would produce under time pressure. This is the jelly bean effect at work. Enough people have written correct implementations of a binary search, or explained the difference between TCP and UDP, or solved a standard SQL query, that the model's averaged output lands close to the truth.
This is where generative AI earns its reputation. Tasks that used to take hours can be done in minutes. Rough drafts materialize on command. Code that used to require hunting through Stack Overflow gets generated inline. The crowd has already solved these problems, and the model is surfacing that solution.
The other 20% of the time, the jelly bean effect breaks down. The failure clusters around three specific conditions, and each one maps directly back to the assumptions that make crowd wisdom work in the first place.
The crowd blends without common sense. In 2023, AI-generated mushroom foraging books started appearing on Amazon. They looked like legitimate field guides: clean layouts, confident descriptions, professional tone. But they misidentified species and, in some cases, suggested tasting as an identification method for mushrooms that could kill. The New York Mycological Society issued a public warning. The model had averaged food bloggers, nature writers, and amateur foraging content into something that had the form of a field guide without understanding that getting a mushroom wrong is potentially fatal. This is the AI equivalent of averaging a recipe from 100 cooks and ending up with ingredient ratios no actual cook would use: each input is sensible on its own, but the blend is nonsensical because the model has no understanding of context or stakes.
The crowd was wrong in the same direction. Galton's experiment works because individual errors are independent: one person guesses too high, another too low, and they cancel. But when the crowd is systematically biased, averaging amplifies the error instead of correcting it. A 2023 study at Long Island University tested ChatGPT with real drug questions posed to a pharmacy information service: pharmacology experts judged 75% of the responses incomplete or outright wrong. In one case, researchers asked about an interaction between Paxlovid (COVID antiviral) and verapamil (blood pressure medication). ChatGPT said no interactions had been reported. In reality, taking them together can cause dangerous blood pressure drops. The crowd's medical knowledge skews shallow because the actual pharmacological nuance lives in specialist literature and behind paywalls, underrepresented in training data. The public-facing content that dominates the training set is consumer-grade health advice, and the model reflects exactly that bias.
The world changed and the pattern broke. The crowd's wisdom is historical. It reflects what was true when the training data was generated, not what's true now. Developers felt this acutely when OpenAI released version 1.0 of their Python library in November 2023, replacing the old openai.ChatCompletion.create() pattern with a completely incompatible client-based API. For months afterward, ChatGPT kept generating the old v0-style code because the training data was overwhelmingly pre-November 2023. Worse, it would sometimes blend v0 imports with v1 call patterns in the same file, producing code that was internally contradictory. A human developer working from outdated Stack Overflow answers would at least keep their approach internally consistent while they figured out which version was current. The model didn't have that awareness; it just knew both patterns were statistically associated with "OpenAI Python code." The irony of ChatGPT not being able to keep up with its own creator's API changes was not lost on the developer community.
The 18th-century mathematician Condorcet proved the math behind this formally: if each person in a group is more likely right than wrong, the group's accuracy approaches 100% as it grows. But flip the condition, let individuals be biased, misinformed, or working from stale data, and group accuracy approaches zero. The crowd amplifies whatever direction it's already leaning. When it's leaning right, AI looks brilliant. When it's leaning wrong, AI delivers that wrong answer with the full confidence of a thousand voices agreeing with each other.
The jelly bean experiment has a hidden lesson that most people miss. The crowd's average was better than 97% of the individuals in the room, but it wasn't magic. It worked because the errors were independent and the question had a definite answer. Change either condition and the wisdom evaporates. Ask a crowd to predict something genuinely novel, something nobody in the room has experience with, and the average is just averaged ignorance.
AI operates under the same constraints. It's spectacular at questions where the crowd has converged on a correct answer, and unreliable at questions where the crowd was biased, outdated, or simply hadn't encountered the problem before. The skill that matters isn't knowing how to use AI. It's knowing when to trust it and when your own expertise is telling the truth that the crowd missed. That instinct, the ability to feel the uncanny valley and know which side of it to land on, is what separates someone who uses AI as a tool from someone who gets used by it.