Hey, wait – is employee performance really Gaussian distributed??
A data scientist’s perspective
tl;dr:
It’s probably Pareto-distributed, not Gaussian, which elucidates a few things about some of the problems that performance management processes have at large corporations, and also speaks to why it’s so hard to hire good people. Oh, and for the economists: the Marginal Productivity Theory of Wages is cleverly combined with the Gini Coefficient to arrive at the key insight.
Ahhh, it’s Q4 at Fortune 500 companies. That means it’s performance management season, when millions of employees are ranked and graded on their accomplishments over the past 12 months. Their bonuses and raises next year – or lack thereof, accompanied by a statement about their future at the company – depend on the outcome of this process.
I’ll graciously sidestep the discussion of how accurate these Fortune 500 managers could possibly be as they assess and rank their direct reports. (Accuracy here is certainly questionable, given issues such as apples-to-oranges comparisons, the principal-agent problem, the many flavors of bias, etc.). Similarly, I’ll also sidestep the negative effects that these performance management processes have on a company’s culture and work environment (don’t get me started!). Let’s instead focus on the statistical foundations of the process through the lens of a data scientist. Spoiler alert: it’s built on shaky ground.
As a quick reminder of the process at a typical Fortune 500 company, managers will annually hold a series of meetings, with the express purpose of slotting all employees into place on a “vitality curve” that looks something like this:
Dividing lines are drawn that vary from company to company, and from year to year. The categories above reflect General Electric's original 10-70-20 split from the early 1980s, which is often cited as the canonical system: the top 20% of employees are given a handsome monetary bonus, and the bottom 10% are shown the door.
Sometimes I like to visualize the same data using a percentile chart instead – you can think of it as literally lining everyone up from shortest to tallest; each person’s “percentile” is the percent of the population that they’re taller than.
The unspoken first step here was to assume a Gaussian bell curve when it comes to employee performance. And, heck, why not? The folks in HR who were paying attention during psychology class recall that a great many things in this world follow a Gaussian distribution, including traits that have predictive power when it comes to workplace performance. IQ is Gaussian. The Big Five Personality Trait known as Conscientiousness is likewise Gaussian. For what it’s worth, human height is also Gaussian, and that’s correlated with workplace success. And it turns out that many combinations of Gaussian variables (sums, convolutions…) are also Gaussian.
Giving it the data scientist “smell test” though, a few things jump out. First, are we really paying corporate salaries to people on the low end of the bell curve? The median individual income in the US is less than $50k, so my instinct is that those in corporate jobs are likely the top half of any bell curve – not the full bell curve. Also: the symmetry of the distribution doesn’t quite line up with my lived experience. I’ve seen many employees making delightful amounts (sometimes millions of dollars worth) of positive impact, but I don’t recall that their impact was offset by the exact same number of employees on the other end of the performance spectrum, making impact of the same millions-of-dollars magnitude, but directionally… un-stellar.
A different Perspective: Pareto
As we think about the impact of an employee, measured in dollars, let’s draw a connection that I haven’t seen made anywhere else. Economists will teach you something called the Marginal Productivity Theory of Wages, the idea being that the amount of money that a company is willing to spend on an employee is essentially the value that the company expects to get out of their work. This strikes me as mostly true, most of the time, and likely to be the case in the corporate world that we’re considering here.
And then on a different day, in a different part of the microeconomics textbook, there’s often a histogram of the distribution of wages: the Pareto distribution, a mathematically interesting power-law. Let’s put these two ideas from Economics together. In any given salary band, employers across the United States are paying folks wages that collectively form a Pareto distribution:
And thus we arrive at my central idea: this is also the distribution of *performance* that we should expect from employees. Performance Management would do well to start with a Pareto assumption instead of a Gaussian assumption.
Like a good data scientist, I’m sweeping the mathematical details into the endnotes [0] so as to not befuddle the decision-makers, but fear not, the fun statistical stuff that I can’t shut up about is there in spades. Fun preview: the fact that corporate job postings these days include salary min/max lets us get an estimate of the size of a salary band, which was perhaps the piece of data that earlier thinkers who pondered such things were missing.
A Pareto distribution also passes the “sniff test” with respect to success and failure. I sometimes think that everything at a Fortune 500 company is Pareto distributed at every scale: my productivity this week, my productivity this year, the productivity of my team/project compared to others, and the success of my company compared to others. A few are quite successful, most are less successful. Supporting that notion with data is a task for another day, but let’s just say that I’ve already started the spreadsheet. In any case, let’s move forward to interpreting what we’ve discovered.
Implications
So what can be learned from realizing that performance follows this Pareto distribution?
Oh, here’s the percentile plot:
There are a few striking things here that have practical implications that inform how we might change the performance management process.
First, the Gaussian perspective holds that there is a bottom ~10% that performs egregiously poorly, and that this is an intrinsic part of the distribution. Corporations are in a constant battle to remove this portion from their payroll annually. On the other hand, looking at the Pareto percentile plot, the bottom 10% aren’t really all that different from the folks in the next 10%. As a matter of fact, there’s not an obvious place to draw a line to identify the “lowest end” employees to expunge. ~65% of employees are performing below the expectations that are associated with the salary midpoint (the green dashed line)!
But there are low-performing employees at large corporations; we’ve all seen them. My perspective is that they’re hiring errors. Yes, hiring errors should be addressed, but it’s not clear that there’s an obvious specific percentage of the workforce that is the result of hiring errors. We also know that some managers are great at hiring, and have a very low error rate – it’s a bit of a tragedy to force them to cut employees who are performing relatively reasonably.
What percentage of employees should be given extra financial rewards? Unfortunately, the Pareto plot doesn’t reveal an obvious cutpoint in this regard either. The top 35% are performing above what’s expected at the salary midpoint, so that’s a starting point. Many companies take a more granular approach than GE’s original formula, i.e. they have “good - better - best” categories for bonuses. That seems reasonable.
Regarding the very top performers, the performance management process often identifies employees who are candidates for promotion. Much like hiring errors that aren’t reflected by the Pareto distribution, I think that promotion candidates should similarly be seen as outliers. Reward them handsomely (but within bounds) since their performance is, in this case, literally off the charts, and be honest in recognizing that they’re currently being underpaid despite being at the top of the pay band. This is a small market inefficiency that the promotion will soon remedy.
Second, let’s consider what the Pareto percentile plot reveals about the process of firing and backfilling. Or more broadly, hiring in general. We can see from the performance vs frequency plot that in any given salary band, low performers are 3x as common as high performers! Wow. This aligns with my experience, and is an insight that speaks to the difficulty of hiring. And to the expected effectiveness of firing the bottom 10% and then expecting to hire people who are remarkably better.
Summarizing:
There is no intrinsic bottom 10% that needs to be expunged annually. Let managers identify any hiring errors if they think their team can do better, but don’t set a target for this number based on faulty statistical notions.
Hiring in general is difficult because low performers are 3x as common as high performers. The Gaussian-inspired idea that you’ll probably get someone better if you fire the low end is inaccurate – the most likely replacement might not be remarkably better.
Things that are missing
If someone were designing a performance management system from scratch, and they let data scientists into the room, the data scientists would likely note the following issues. These issues will all require thought, and a design that is specific to a given company (which is another way of saying that there are no easy answers that can be implemented tomorrow by everyone).
Monitoring
Any statistically driven system that results in decisions that cost lots of money should be monitored to make sure that it’s achieving its goals, and that it continues to work as time goes on. How do we know it’s working? What are our actual goals? There might need to be measures in place that let us know if the company is inaccurately categorizing people. Or if managers are hiring sacrificial lambs to protect their valued team. Or if there really is differentiation in performance between those who are being fired and those who are being kept.
Decision Theory style cost analysis
The performance management process itself requires quite a bit of time and effort. Firing 10% of the company comes with costs (severance pay). Replacing those people is also costly (a helpful order of magnitude estimate is that the hiring process all told costs the company approximately a year’s salary). That new person will statistically be with the company for, say, three to five years. Is the increase in performance greater than the cost of the entire process? Are there opportunity costs, such that the company is better off with a less intense performance management system?
A Time Series perspective
Performance management is a snapshot in time – one year. If an employee’s performance has been exemplary for three years in their current role, then has a down year, should the employee be shown the door? Would a longer term perspective on the employee’s suitability be a wiser approach?
Am I measuring what I think I’m measuring?
Data scientists worry about this a lot. (Well, the good ones do!) Is one employee's success rate lower because they’re given all of the most difficult challenges? Now you’re measuring the way that projects are assigned. Are extroverts consistently given better ratings despite equal performance? Now you’re measuring personality traits and managerial bias.
Things like this can be somewhat accounted for and watched for when designing the assessment process, but there might be trickier things afoot. It’s a manager’s job to set their employees up for success, and managers are highly unlikely to admit that they weren’t doing so, especially during performance management season when they themselves are being graded.
It’s my opinion that the biggest factor in an employee's performance – perhaps bigger than the employee’s abilities and level of effort – is whether their manager set them up for success. It’s not easy for employees to change teams to get a better manager (perhaps it should be easier!). Maybe the rate at which employees jump ship these days and get jobs at other companies is an indicator that managers aren’t giving their employees the chance to succeed.
Similarly, it’s my observation that managers who are new to a company tend to do less-than-optimally their first time through the performance management process, especially when it comes to securing preferred performance slots for the people who work under them. The inevitable gamesmanship and political horse trading inherent in the process will go down differently at each company, and the first time through the process can be a harsh learning experience. What’s being measured might be your manager’s persuasiveness in the performance management sessions, and their political capital, as much as it is a measure of your performance.
Conclusions and Outlook
Performance management, as practiced in many large corporations in 2024, is an outdated technology that is in need of an update. Abandoning the Gaussian assumption, and instead assuming a Pareto distribution reveals that there is no statistical basis for firing the bottom 10% of workers annually. Companies would be better off treating hiring mistakes as outliers, and should also think about whether it’s really worth it to fire 10% of their workforce annually. Monitoring to measure whether goals are being achieved, and cost analysis to determine if it’s worth it, are likely wise given the expenses of performance management.
I’ll also mention here that the conditions that birthed the modern performance management system (GE in the early 1980s) included an assumption of lifetime employment at a company, as long as major mistakes weren’t made. It was likely valuable, in those days, to implement some sort of carrots-and-sticks program. Forty years later, employees jump ship regularly. Perhaps withholding carrots (i.e. giving an insultingly small bonus for the lowest performers) might be enough of a stick, and healthier for all involved. Figuring all this out will take more work.
Bonus Content #1: Is there any data to support this Pareto idea?
Executive summary: yep, there’s data from the early 20th Century that’s supportive, but seems to have been forgotten. The literature basically agrees that assessing an employee’s performance is quite subjective; identifying objective numbers isn’t an easy task, but what’s out there is consistent with the Pareto view and not the Gaussian view.
Attempting to be brief so as not to lose the reader entirely:
In 1957, William Shockley (Nobel Prize winner and noted jerkwad) looked at scientific publications and patents from the early 1950s from Los Alamos, Brookhaven, and Bell Labs. His conclusion – and the data are compelling in my view – was that performance is definitely not Gaussian. His take was that the distribution is Log-Normal (a distribution that can be very similar-looking to Pareto) based on a pet theory. In 1970, Clarence Zener (yes, he of the Zener diode) in a later publication looked at Shockley’s data along with some earlier data from a couple of other publication, including “a large sample of 6891 authors, everyone, in fact, listed in Chemical Abstracts during 1907-1916 whose names started with either A or B”, and put a lot of ink into the Log-Normal vs. Pareto question, since the answer is obviously not Gaussian or anything even close to Gaussian.
More recently: Herman Anguinis has looked through data from sports, entertainment, politics, and amusingly enough, Materials Science journal publications, which makes me a part of his dataset (!). I don’t think he’s managed to put his finger on a slam-dunk dataset, but his data is directionally correct. My take on his work is that his data tends to center around total lifetime output, which speaks more to the length of one’s career than annual output, but his instincts are correct.
There are a few other references in the end notes for the curious, including a link to an AI-assisted lit search from undermind.ai.
References are all in endnote [1].
Bonus Content #2: Regarding the practice of “rolling up”
Executive summary: omg we’re unjustly penalizing ~5% of our employees and we didn’t know it. If we had a really big sample size, ~5% of employees would be placed into a higher performance tier based on their performance. We’re departing from objective standards by using a “forced distribution”.
Is there anything else that a Data Science perspective can elucidate regarding typical performance management practices? Ah yes! Let’s say that managers are sitting in a meeting evaluating a batch of 27 employees, aiming to identify the bottom 10%. That would be… 2.7 people. Someone says, “Hey, there’s no way that we can hit these percentages exactly. And what if this batch has no underperformers?”, and the response is often something like “the percentages will work out when all of the batches across the larger organization are rolled up together. So we rank-order the employees that are borderline cases, and the final cutoff lines will be drawn when everything is combined.” Everyone in the meeting nods at the idea that the Vitality Curve will emerge and all will be fair when there are enough employees batched up together. It occurs to me that this is testable with a little simulation, so I’ll believe it when I see some data! Perhaps a minimum cohort size might emerge where “rolling up” causes very little collateral damage?
Let’s simulate the performance of a cohort of Gaussian-distributed employees, and further let’s pretend that managers are miraculously perfect at evaluating employee performance, so a list of best-to-worst employees is generated in exactly correct order. (For the data scientists: AUROC = Mann-Whitney U-Test = 1). The question before us is thus: does the “rolling up” process introduce any inaccuracies? More specifically, since we secretly know every employee’s exact performance, are they correctly categorized by the process of rank ordering and drawing lines at specific percentiles?
Here is a plot of the percent mis-categorized unjustly vs. cohort size, with 32 simulated annual performance reviews for each cohort size. The Welch GE cutoffs were used (bottom 10%, top 20%). I’ve only counted the number of employees who would have been in a higher category if objective standards were used instead of rank-and-chop; the numbers basically double if we decide to also count employees who are (beneficially) bumped up a category instead of down. My takeaway here is that there’s a very good chance of mis-categorizing 2.5% of a cohort no matter the cohort size, and 5% happens commonly even if the cohort size is a couple hundred.
Endnotes
[0] Mathy Details!
Size of a pay band: when posting corporate job openings, there’s a marvelous new practice of displaying a salary band (min,max). A casual browse has revealed the tendency that a minimum salary is generally around 0.65x of the maximum, for any given pay band. This allows us to visualize the Pareto distribution within a pay band, which I haven’t seen done anywhere in the literature yet. For those looking to play along at home: a pay band of 78.1 → 121.2 is centered on 100 and has min/max = 0.65.
Pareto alpha=1.75 this was drawn from a fit to individual salary data in the US (thank you, IPUMS for the data, details below), which can also be cross-checked using the Gini Coefficient as follows: G = 1/(2a-1), where G = 0.4 is commonly cited for the US individual income (not household income!). Note that this one parameter, along with the size of the pay band, is all we need in order to specify the distribution! https://en.wikipedia.org/wiki/Pareto_distribution is a great starting place for those who wish to ponder the equations.
Gaussian sigma: the outer edges of the pay band were assumed to be 3 sigma, but using e.g. 2.5 sigma gives similar results. Toss the outliers, which is 0.27% or 1.24% of the simulated data. (Easily justified: HR has rules about staying within the pay band, and is perfectly happy to underpay or overpay folks a little at the edges). Use 7.06 for the 3 sigma scenario, or 8.48 for the 2.5 sigma scenario, fits the 100 +/- 21.2 payband.
Centering at 100 “performance units”: These plots are centered at 100 not so much to represent a salary of $100k, but rather because humans are fond of thinking of things on a scale of 0-100. A fun mathematical aspect of Pareto distributions is that they’re scale-invariant / self-similar, i.e., the dropoff from 100k to 90k is the same as the dropoff from 100 million to 90 million. So the choice of 100 is arbitrary, and the calculations work for any value that you might choose; the only quantity that matters here is the width of the pay band as a percent of, say, the minimum value. And similarly, with the Gaussian, the choice of median is arbitrary, the only thing that matters is sigma / xbar.
Plots are n=25_000
Perhaps I’ll clean up the Jupyter notebook and put it on GitHub at some point.
[1] Lit search
A few takeaways:
My reading of the more modern literature is that they’ve basically forgotten the Zener & Shockley papers, which I find to have the most compelling data. Regarding Pareto vs. Log-Normal for employee performance (and for individual income): to me, there’s not much practical difference. (I do have my own pet theory about the origin of the Pareto (or LogNormal) in these circumstances, which involves neural networks and hopefully ends with a Renormalization Group universality class, but that’s a bigger project for a later time.) Back to the literature: there are a number of other papers and articles not listed here that champion the idea of “it’s not Gaussian”, but they’re a bit short on listing concrete suggestions on what one should do once that realization sinks in.
Clarence Zener “Statistical Theories of Success” Proceedings of the National Academy of Sciences Vol. 66, No. 1, pp. 18-24, May 1970
William Shockley “On the Statistics of Individual Variations of Productivity in Research Laboratories” Proceedings Of The IRE 1957 page 279
Richard J. Chambers II “Evaluating indicators of job performance: Distributions and types of analyses” Doctoral Dissertation College Of Education Louisiana Tech University November 2016
Ernest O’Boyle Jr. And Herman Aguinis “The Best And The Rest: Revisiting The Norm Of Normality Of Individual Performance” Personnel Psychology 2012, 65, 79–119
A. Drăgulescu and V.M. Yakovenkoa “Evidence for the exponential distribution of income in the USA” Eur. Phys. J. B 20, 585–589 (2001)
A. Christian Silva and Victor M. Yakovenko “Temporal evolution of the thermal and superthermal income classes in the USA during 1983–2001” Europhys. Lett., 69 (2), pp. 304–310 (2005)
Additionally, I did an AI-assisted literature search to make sure I wasn’t missing anything big and important, which you can access here:
The data that I used to fit the Pareto to individual income (not household income!) came from the variables
INCWAGE (Wage and salary income)
ASECWT (Annual Social and Economic Supplement Weight)
from
Sarah Flood, Miriam King, Renae Rodgers, Steven Ruggles, J. Robert Warren, Daniel Backman, Annie Chen, Grace Cooper, Stephanie Richards, Megan Schouweiler, and Michael Westberry. IPUMS CPS: Version 11.0 [dataset]. Minneapolis, MN: IPUMS, 2023. https://doi.org/10.18128/D030.V11.0
Very interesting article (linked to from Byrne Hobart's newsletter); I subscribed post haste and hope to see more posts from you in future. Incidentally, re log-normal vs. power-law, for the Patreon "creator economy" service the data show that both the (monthly) earnings per project and the number of patrons per project appear to follow a log-normal distribution. See for example https://rpubs.com/frankhecker/993611 and https://rpubs.com/frankhecker/994383 (Patreon is a more extreme example of "pay for performance" than any corporate environment, with a Gini coefficient of 0.84.)
I wanted to agree with this because we loved Pareto distributions but we used to say 80% of the work was done by 20% of the team. I think though you're kind of saying 80% of the team is essentially one level of productivity/contribution and then 20% are either hiring mistakes or promotion mistakes.
There's no way around three aspects of performance management:
• At some "n" (~100) there is a normal distribution of contribution. It might cluster around 3 or 5 groups but it is. Just as in freshman calculus at MIT there is a cluster of scores this way even though everyone got 1600 on their SATs and 4.0 GPA at admission. The biggest failure of the implementation of this distribution is too small a set of people (lazy management that avoids the process of finding groups) OR mixing job functions and/or expectations to make the group. The latter is not comparing similar levels of experience or combining job functions that make it impossible to compare output.
• There is a budget for dollars/stock and it is fixed which means no matter what you have to create some distribution and enforce it. You can always just give everyone a gold star and randomly assign amounts of money but that is always perceived as political. You can give everyone the same amount of money but that is a different kind of politics. If you presume too much similarity in people then the 20% doing all the work won't get meaningful rewards.
• Part of a finite budget is that you can't promote everyone all the time or said another way you can't promote every individual through the entire system. There's not enough salary budget or bonus budget. I like what you say about a short term performance "error". As with giving everyone a gold star you can also just separate out titles/levels/rank from compensation and then everyone will be a VP like in a bank but people will find other ways to know who is senior.