Legibility at scale: why hiring is broken
How do people develop valuable skills, and how do employers discover who has those skills? These questions have permeated society for a long time. Ever since humans began to organize themselves into tasks and roles, and economies and cities emerged from that organization, the problem of human capital allocation has existed.
We can distill it to a single question: “Who can I trust to hire?”
Villages solved it through direct observation. Guilds solved it through apprenticeships. Professional networks solve it through reputation. The form changes, but the problem persists.
Information overload and the rise of shortcuts
With language, data, artifacts, and résumés, information began to scale. Information scaling was initially stable, then it became non-linear. Today, the problem is not lack of information, but too much information.
“Reality feeds the participants so much information that they need to introduce dichotomies and other simplifying devices to make sense of it. The simplest way to introduce order is binary division: hence, the tendency to use dichotomies.” — George Soros
Our systems are not meant to handle this volume. Humans resort to mental shortcuts and cognitive biases. Before gen-AI, we rarely had a practical way at scale to translate and combine the abundant information into a process that reliably selects candidates with the highest probability of success. Or, in my view, candidates with the lowest probability of getting someone in HR fired.
Emergence versus centralization
Reputation, trust, and skill verification are emergent properties. They arise from repeated interactions and social proof, not from any central authority. But the current system tries to centralize them: universities, HR departments, LinkedIn endorsements that mean nothing.
That is why most solutions optimize for shortcuts that are, on average, high signal: universities (credential), résumés (self-report), interviews (verification), and centralized filters of quality like LinkedIn, ATS, and keyword screening.
Organizations are optimizing for throughput under uncertainty, so they adopt low-cost proxies that are gameable and that systematically exclude high-skill but low-legibility candidates. (By legibility, I mean the ease with which a system can translate evidence of capability into a decision. By throughput, I mean how many candidates a hiring process can evaluate per unit time and cost.)
The problem is not centralization per se. Centralized measurement can work: structured interviews, work samples, validated assessments. The problem is centralized proxy filters that optimize for throughput over accuracy: keyword screening, credential requirements, pedigree matching. These scale efficiently, but measure the wrong thing.
Why the system is breaking
This system is breaking, and for companies hiring for outliers it just does not work. Three forces are driving the break.
First, give humans enough information and the right incentives and the game will be gamed. The market is full of websites, tutors, AIs, and other tools helping candidates build résumés, write cover letters, rehearse for interviews, and look like the best candidate. Everyone thinks they are solving a complicated system when in fact it is complex. You incentivize the best candidates to be hired, but you end up hiring the ones who are more rehearsed, who wear the mask better.
Second, when you hire for the best talent, the average does not work. You want the talent that does not fit the probability, that looks like an anomaly. Some people win life’s lottery and have every dimension valued at 10, but typically you see valleys: high peaks in one dimension and low or medium on others. Karim from Ramp has a good explanation of this: hire for spikiness. Not only that, talent and personality can only be known across time.
Third, the environment shifted. Credentials signal less (degree inflation, online education, career switching). Skills evolve faster than institutions can certify them. And AI risks becoming an accelerant of the worst version of the current system.
AI can centralize judgment into an algorithm instead of distributing it across a network, unless it is used to scale distributed assessment: synthesizing references, evaluating work samples, aggregating diverse signals. The risk is that most companies will use AI to optimize the broken system (better keyword screening) rather than replace it with better measurement.
Two kinds of emergence we keep confusing
To understand why this is failing, we need to separate two emergent phenomena that current solutions conflate.
Emergent Phenomenon A: How reputation forms (distributed, through repeated interactions and social proof).
Emergent Phenomenon B: How hiring decisions get made at scale (pattern matching on credentials because emergence doesn’t scale).
Problem A: “Candidates that are less legible but have real capabilities and reputation don’t translate into the hiring process.”
Problem B: “Companies can’t actually assess capability at scale, so they use proxies.”
These seem like separate problems, one about reputation formation and one about hiring at scale. But they collapse into a single issue: legibility. Candidates cannot make their capabilities legible. Companies cannot make sense of what is legible.
Why slow hiring works (when it works)
The best hiring, small teams with extensive reference checks and high involvement from the team, works because it lets emergent reputation surface through distributed social proof. Nobody certifies someone is great. It emerges from the network’s collective assessment.
The hiring process is long, not short. The candidate speaks with multiple team members. Each person brings a different lens and analysis set. Over time, the candidate also evolves. You collect data points across a time series that is sufficiently spread out and enriched by deeper explanatory variables. You move from a simplistic linear picture to a richer non-linear one. You build an internal hiring neural network.
That internal network is strengthened because each team member can recruit their own external network to surface information that is hidden and excluded from the market’s assessment of a candidate.
This method of slow hiring with external validators works because there is less translation leakage. Recommendation happens through a medium that allows extended exchange: conversations, emails, text messages, without a forced timeline. You go past status symbols and credentials. You understand who the person is, how they work, where they succeeded and failed.
From these interactions, a distinct reputation emerges. The best recommenders appear and gain credibility. This is, in a way, still underdeveloped in modern systems. We still live in a society where the best readers and recommenders of talent are often hidden. But the more they recommend, and the more successful those recommendations are, the more their status increases and trust emerges. Reciprocity and giveback likely emerge too: those helped want to give back.
What also emerges is information aggregation without central authority, something close to the candidate’s true value. Before, you knew the role you were hiring for, your willingness to pay, and who you were hiring. You did not know whether it was a bargain or overpriced. You also did not know whether the person was a growth stock, with increasing returns to scale, or someone who would need more resource allocation to unlock their talent.
In summary, it works because:
- It is distributed (multiple independent assessments).
- It has feedback loops (validators stake reputation and learn over time who is credible).
- It compounds information (each conversation adds context others do not have).
- It is anti-fragile to gaming (no single filter to optimize for).
This model works best when performance is high-variance, context-rich, and observable over a reasonable horizon, and when a bad hire is expensive enough to justify depth. It fails when roles are commoditized, turnover is expected, or output is hard to attribute, and when the organization cannot afford the time and attention required for compounding evaluation.
This method has real advantages: distributed assessment, compounding information, resistance to gaming. But it is not perfect. Networks encode bias. A tight-knit group can be collectively confident and collectively wrong, especially if they share backgrounds, values, or blind spots. The question is not whether networks are always right. The question is whether they can be more right more often than credential proxies, and under what conditions.
Why LinkedIn cannot produce real reputation
So why cannot this work in a network like LinkedIn?
First, information is scattered. There is no coherent source of truth. Each node has its own opinion and experience. To benefit from collectivism and make a candidate legible, you need an integration layer, a central API that connects to many systems of record. How do you tap into all of them in a way that is trusted?
Second, networks and recommenders derive value from status and reputation. In that sense, networks are exclusionary by design. You enter through proof of work and proof of outcome. Each person needs a concrete reason: “I trust them because …” Trust is built through exclusion initially.
Third, LinkedIn endorsements are meaningless:
- No verification of outcomes (did the person actually succeed in the role?).
- No cost to endorse (I can endorse 100 people with one click).
- No reputation stake (my bad recommendations do not hurt me).
- Gaming is trivial (quid pro quo endorsements).
- LinkedIn rarely observes the thing that matters: did the person actually perform in the role, over time, relative to expectations.
Why the current system persists anyway
The current system persists not because it is good, but because the alternatives do not scale.
- Small-team hiring with deep reference checks does not scale. You cannot hire 100 people a year this way.
- Emergent reputation through networks excludes outsiders. If you are not in the network, you are invisible. This is how old boys’ clubs perpetuate.
- Credentials (degrees, job titles, company logos) scale efficiently. You can screen 1,000 résumés in an hour.
- High-status recommenders already capture the upside inside their private networks. They help friends and people they care about get jobs (social capital). They help companies they advise find talent (relationship capital). They build reputation in their immediate network (status).
The system is solving for throughput and legibility, not for accuracy, and not for the small AAA teams.
What would good look like?
A system that improves accuracy and fairness usually pays for it with time and cost. The only credible way out is to reduce the marginal cost of measurement without collapsing the process into a single opaque score that becomes the new proxy.
If we wanted to build a system that lets emergent reputation surface at scale, it would probably need three things.
Signal production: candidates need ways to demonstrate capability. Public work artifacts (GitHub, portfolios, writing, projects). Verified outcomes (provable results from past roles). Performance on standardized tasks (work samples, case studies).
Signal aggregation: the system needs to combine multiple independent assessments. Not one gatekeeper (HR, ATS), but distributed evaluation. Multiple evaluators with different perspectives. Time-series data, not just a snapshot. Cross-network validation, not just your friends vouching for you.
Incentives and accountability: recommenders need skin in the game. If your recommendations succeed, you gain credibility. If they fail, you lose it. Track record becomes visible, not just endorsements. Costs for bad recommendations (reputation damage, possibly financial). Rewards for good recommendations (status, access, possibly monetary).
None of these are easy. Each has hard tradeoffs. But together, they point toward a different architecture.
Rejecting the steelman
A skeptical reader can say: “The system is not broken. It is doing what it was designed to do: minimize organizational risk and transaction costs. Outliers are not systematically missed; they are not worth the assessment cost at scale. The rare teams that need outliers already run bespoke loops.”
That may be true under current constraints. But constraints are not laws of physics. They are often habits, incentives, and default tooling. The status quo holds until someone builds a system that changes the cost curve of measurement enough that accuracy and throughput stop being in such direct tension.
What this is and is not
This essay is exploratory, not empirical. I am trying to simplify the problem statement and re-analyze it, not prove a solution. The claims here are hypotheses worth testing, not settled facts.