I have an idea that would keep 100 percent of foreign-born terrorists out of the United States. Not only that, it’s far simpler than any presidential candidate’s proposals. All we have to do is this: Never let anybody in. Most of us find this idea ludicrous, of course, and rightly so. Keeping out terrorists is not the only goal of border policy; it’s essential that the vast majority of people can come and go freely, whether for pleasure, business, or survival. Yet many of our decisions are based on similarly shoddy reasoning: We often fail to consider that there are two sides to the accuracy seesaw.
For example, we celebrate mammograms for detecting 84 percent of breast cancers and bemoan when law enforcement can’t access a phone, but we overlook how often mammograms detect fictitious cancers, or how many hackers were foiled by encryption. Even with this realization, you might still be tempted to trust a test when you know how often it’s correct overall—like an athletic doping test that’s right 85 percent of the time. This “percentage accuracy” metric does reflect how the test does on both clean and dirty athletes. Unfortunately, that’s still not enough: If the 15 percent of the time the test is wrong is the 15 percent of subjects who are doping, it’s not a terribly helpful test. Only by directly weighing both sides of the balance—the “precision/recall tradeoff,” as it’s known in computer science—can we make good decisions, whether in national policy, medicine, or our personal lives.
The terminology of recall and precision comes from the world of language processing, where software often has to classify inputs as relevant or irrelevant. For example, say you’re building the iPhone system that notices when a user says “Hey Siri.” To make your users happy, it’s not enough to be right most of the time about whether the user just hailed the phone (or to have high-percentage accuracy). After all, most of the time Siri is not being paged; if all you cared about was the percentage of your guesses that are correct, you could do well by almost always ignoring the user asking for Siri, since a low guess-count would yield a higher percentage of correct guesses. Clearly, that would not be ideal.
Instead, the system needs to balance two related, more specific metrics of performance. On the one hand, you want to maximize the recall—the percentage of times users say “Hey Siri” that are registered. On the other hand, you also care about precision: You want to minimize the number of times the iPhone thinks they said it when they actually didn’t.
It’s no exaggeration to say that the concept has changed the way I view the world.
These two metrics are inherently in tension. If you detect “Hey Siri”s on a hair trigger, you’ll have very high recall but terrible precision. But if you refuse to register a “Hey Siri” unless you’re certain, you’ll have great precision but miserable recall. Even if you manage to score highly on both metrics, as long as your system is imperfect, it’ll still miss some requests and answer some spurious ones. To know how well you’re doing, and accordingly to build the best system, you need to explicitly decide how much you care about each type of accuracy.
Of course, your decision will depend on what you’re trying to do. For Siri, neither beeping unnecessarily nor missing a salutation is the end of the world, so recall and precision are of roughly equal priority. But for a credit card company trying to detect fraudulent transactions, letting fraud go unnoticed is far worse than calling you for extra confirmation. They’ll want to heavily favor recall.
Talking to phones may seem like a far cry from securing borders, but the two are subject to exactly the same reasoning. Much like Siri, a Homeland Security official deciding who should be kept out needs to trade off catching as many bad actors as possible against minimizing the bycatch. Yet all too often, the policy arguments we hear pretend that only one of the two metrics matters: “But there will be poor, innocent refugees who will be excluded!” (Precision.) “But there could be a terrorist who gets in!” (Recall.) A sensible discussion must acknowledge both concerns, and move from there to how we ought to weigh them against each other.
The ramifications of the precision/recall tradeoff extend far beyond immigration policy. In fact, the same logic can be applied to any attempt to classify something as good or bad where we’re somewhat uncertain. When we set the criteria and vetting procedures for welfare, what fraction of spurious recipients are we willing to tolerate in exchange for helping what fraction of the needy? How certain should a soldier be that a building contains an enemy combatant, rather than a civilian, before ordering a military strike? Under what circumstances should the FDA approve a drug with less safety testing, potentially helping desperate patients but also risking major harm? Each decision boils down to a compromise between catching all the bad things and the risk of collateral damage.
The tradeoff even pops up in life situations that we wouldn’t normally think of as binary decisions. One of the most creative people I know recently confided to me that he’s had ideas he’s excited about, but he’s been reluctant to share them because it would take so much effort to verify that what he was saying was accurate. While I’ll be the first to champion the need for factual accuracy (especially in the current political climate), I had to wonder: how many spectacular ideas was he robbing the world of to avoid that occasional false positive? Was achieving precision really worth the hit in recall?
I’ve wondered the same about my own reluctance to join political movements that I agree with, for fear that I’ll regret it later—that some apparently beneficial policies they endorse will turn out to be bad ideas. Perhaps I’m sacrificing positive impact on the altar of precision, refusing to endorse even the ideas that are probably good. My fear of being proven wrong implicitly undervalues recall.
While precision versus recall is not the only kind of tradeoff, it’s an extremely common one, and it’s no exaggeration to say that the concept has changed the way I view the world. With so much rhetoric focusing (deliberately or not) only on one side of the seesaw, recalling this tradeoff is an excellent way to avoid being lied to with statistics. It’s an essential conceptual tool in the belt of any decision-maker, public or private—so much so that I’d say it’s the most important thing I’ve learned in grad school.
Jesse Dunietz, a Ph.D. student in computer science at Carnegie Mellon University, has written forMotherboard and Scientific American Guest Blogs, among others. Follow him on Twitter @jdunietz.
The lead photograph is courtesy of Vimal Kumar via Flickr.