Ashish Jha et al published a paper in this month’s JAMA comparing the US healthcare system to other countries. It’s a fairly robust effort comparing several system features, so the paper made its rounds through medical twitter, although I wouldn’t say that it’s any different from the Commonwealth Fund reports that get published in Health Affairs every year and from which the Jha paper draws heavily (more on this later). I’m not sure how familiar the medical community is with Health Affairs literature, since that’s a journal designed for administrative types that are the mortal nemesis of clinicians, so there may be some utility in publishing this in JAMA. The paper’s basic argument is that the US does not appear to be meaningfully different from the mean of a basket of comparable OECD countries on measures of utilization or social spending and therefore the cause of the high healthcare spending in the US is prices. This is an analysis that we have seen before, notably Uwe Reinhardt’s 2003 paper which reached the same conclusion. Although I do not disagree with the Jha conclusion that the U.S. prices for goods and services are elevated and need reform, the proposition that the U.S. isn’t structurally different from other systems is incorrect and needs to be put into context.
When I started my degree in health policy, it immediately struck me how the U.S. healthcare system does not function as a free market. Without prices set by competitive interaction between patients and providers the system suffers from the chronic information problem Hayek highlighted in The Fatal Conceit where any innovations are stuck in the hospital where they were developed and diffuse through slow channels like academic journals and conferences rather than the much faster market pressures. Even if a large system like Kaiser Permanente is doing something very unique and effective, the rest of the healthcare system may never find out about it because of the lack of signal mechanisms. My intuition was therefore to focus on comparative healthcare research and to look for international perspectives of how a healthcare system could be organized and coordinated. Perhaps we can identify best practices by figuring out what some countries are doing drastically differently to achieve drastically different results? The good news is that there is indeed a rich area to explore, especially in the developing world where healthcare systems can’t afford to waste money like we do (my quick look into Iran’s community health worker system is just one example). The bad news is that that comparative healthcare research is a very new area and the research methodology and data are often questionable. Although comparisons of the US healthcare system to other countries have been made many times in the past, such as this talk from the 1980s by Milton Friedman at the Mayo Clinic in which he compares us to the NHS, the first major thrust for comparative healthcare system research was made by the WHO with their 2000 health system report in which they attempted to rank 192 countries in a variety of areas, ultimately nominating France as the best healthcare system. Although this pleased the French, the report was deeply criticzed by academic and politicians.
Data Problems
The major problem with the WHO report, and much of comparative healthcare literature in general, is that the data was unreliable. The report provided an 80% uncertainty interval for each country, which obviously leaves a lot of room for error that would be unacceptable in other healthcare literature. These intervals made it difficult to state with a large degree of certainty that any system really was better than another and resulted in large potential swings, with the US potentially ranking anywhere between 7th and 24th. In order to improve the data available and expand our understanding, the Commonwealth Fund began compiling comparative reports on the performance of several English-speaking countries, eventually expanding to include several other Western European and Nordic countries. The Jha paper relies on the Commonwealth Fund surveys for perception and access metrics. Does the Commonwealth data achieve their goal of providing high quality data? I respect the Commonwealth Fund tremendously and appreciate the difficulty of their task, but having worked with their survey I don’t believe that this goal is achieved for the US. The Commonwealth surveys have sample sizes of 1000-2000 per country, with the exception of Canada which generally has samples of 4000-5000 because the Canadian Ministry of Health contributed funds to expand their survey sample size. A sample of size of 1000 may be sufficient for small countries with homogeneous systems like the UK, Switzerland or Sweden, but a basic sanity check tells us that it would be inadequate for competently describing a diverse federal system like the United States, especially when you consider that New York is not Alaska and the 2000 sample size is spread out evenly across the 50 states. 40 responders don’t tell me much about healthcare in NY, especially considering how the differences between NYC and upstate NY are like night and day. In addition to the Commonwealth surveys, the Jha paper also relies on the 2016 OECD Health panel for utilization and spending data. I haven’t personally worked in depth with the OECD data, but my general attitude can be summarized by a gaffe from a former Ukrainian state economist, “the Ministry hasn’t published the report yet because we haven’t decided yet what we want the numbers to say.” The Jha paper itself admits that there’s problems with the OECD data: “When OECD data were not available for a given country or more accurate country-level estimates were available, country-specific data sources were used.” I propose that it’s irresponsible to make authoritative claims like “utilization rates in the United States were largely similar to those in other nations” when the data is compiled from a diverse set of sources of clearly varying reliability.
Analysis Problems
Jha et al suggest that the US system is unremarkable because US utilization for various services is similar to the average utilization of the eleven countries analyzed. For example, US discharges for mental and behavioral conditions is 679 per 100,000 population so the US is unremarkable because it’s close to the average discharge rate of 736. This analytical strategy is fundamentally flawed for two reasons.
The first is that the average behavior of a basket of very different systems effectively has no meaning because you cannot infer any conclusions on the behavior of any single system based on its relationship to the average. If you take the mental health discharges example, if I wanted to design an average healthcare system from scratch, the average of 736 per 100,000 population tells me nothing about how much staff I should hire, what facilities I should construct and which patients I should admit. Furthermore it would be incorrect to describe the US protocols as “typical” because these systems are clearly very different. The mental health discharge rate for Netherlands is 119 per 100,000 population, while the rate for Germany is 1719 per 100,000. These are not systems that are doing the same thing and marginally varying around a mean, there’s a difference by a factor of 14! The US can’t be a “typical” system because a “typical” system clearly can’t exist with such large variability. Either the Germans or the Dutch are doing something very wrong, because I highly doubt that they’re both right. And if they are somehow both right then the entire concept of the average becomes irrelevant because countries deliver care that is appropriate for their population and any nation’s position with respect to the average is irrelevant since the average is not representative of their proper patient population. Another example of this problem is hospital discharges per 1000, where Japan and US have similar rates of ~125 hospital discharges per 1000. Can we safely conclude that Japan and US have similar admission and discharge practices? That would be an impossible conclusion if you knew that Japanese hospitals are simply closed on evenings and weekends. These two systems will clearly behave very differently as a result of this hospital schedule even though they may appear to be similar when only looking at them on a graph.
The second problem with the use of the average of eleven countries as a benchmark to compare US performance to is that the average can be manipulated to serve your needs by adding or removing countries. Why does Jha use these specific countries? They state that it’s to use the wealthiest countries, but that’s a dubious proposition because the US is MUCH wealthier than the rest of the world. If Germany was a US state it would be the 40th poorest state in the union, France would be the 48th poorest. Are these really comparable? We have to therefore admit that we’re taking some liberties in how strict we are in our selection, and if we are going to do that then maybe the analysis should be expanded to the full 34 countries of the OECD? Alternatively, maybe we should be more strict with the selection criteria. Culture obviously has a large impact on population health, and Japan is one included country that’s clearly very different from the rest as reflected by Japan being the lowest on just about every utilization metric. But if we’re removing Japan, are the Germanic countries really that similar to the US? Perhaps it would be best to focus on the four English-speaking countries which will be the most similar culturally? But doing that changes all the averages. It shifts the COPD discharges per 100,000 population from 206 to 252 and suddenly the US utilization rate of 230 goes from the middle of the pack to the lowest utilizer! The point is that there is no such thing as objective data, all data has an opinion, and all comparative analysis of many countries is vulnerable to manipulation by adding or removing countries until you make the US look the way you want it to look. Using an average as a benchmark does not fix this critical flaw.
Knowledge Problems
Finally, a major challenge that I encountered while working on comparative system research data is simply figuring out the truth of what I’m looking at and how the data maps onto reality. Nassim Taleb calls it the pseudo expert problem, where you may have a lot of knowledge about a topic but still have no idea what’s really going on because you don’t have direct contact with the situation. In comparative system research the problem comes from trying to reduce complex systems to simple metrics that can be analyzed and compared, which unfortunately strips out critical qualitative information.
For example, Jha puts forward data suggesting that social spending in the US is similar to other countries, but doesn’t comment that this is only true after adding up public and private contributions to the programs that constitute “social spending”. That’s not exactly what most people would consider as “social” spending and after taking away private contributions the US quickly falls behind. Is it appropriate to combine private and public social spending because Americans are generally wealthier than Europeans and have more disposable income? I don’t know, but the issue is too complicated to be casually glanced over. One more example is hospital beds per 1000, which Jha reports as 2.8 in the US and 2.7 in Canada. These figures may lead a policy analyst to conclude that these systems have similar levels of hospital bed supply and similar problems, but this would be incorrect. Canada has a chronic shortage of hospital beds that does not exist in the US because Canadian law requires patients to be discharged directly to appropriate post-acute care. However, there is a shortage of post-acute care facilities in Canada, resulting in as many as 13% of Canadian hospital beds at any time being occupied by patients who no longer need acute care and are simply waiting to be discharged, sometimes for months. These beds are effectively virtual, resulting in blockages in other areas, such as EDs where it is not uncommon for Canadians to wait over 24 hours just to be admitted. Another example is primary care physician consultations for which Jha reports 4 visits per person per year in the US compared to an average of 6.6 visits per person per year. An analyst may even be impressed with Germany which reports just over 10 visits per person per year. This analysis, however, makes the error of assuming that all consultations are born equal and ignores the fact that American consultations are on average ~22 minutes long compared to only ~8 minutes in Germany. To make matters worse, we don’t even know if the American consultations are of higher quality, because the American doctor may be fiddling with the EHR for 20 minutes with only 2 minutes dedicated to the patient, while the German doctor may be able to dedicate the full 8 minutes to the patient. Baicker and Chandhra raise similar concerns regarding intensity, quality and measurement in their response to Jha. We almost never know what the reality is!
Conclusion
This rant isn’t meant to discredit the work done by Jha et al, or the effort of the organizations responsible for the data they use in this paper. A lot of work obviously went into this report and it has definitely expanded people’s knowledge of the area, as evidenced by the fanfare it continues to receive on social media. I’m only seeking to express my frustration over the significant analytical errors that are present in most literature comparing healthcare systems. Healthcare is 18% of the economy and any errors have massive ramifications. My own view, to echo Hayek, is that these systems are far too complex to be properly analyzed en masse and instead it is best to focus on very specific topics ie rural nursing home care in Japan and United States. Anything more broad than that is almost certainly going to be wrong and I encourage researchers to hesitate before making firm statements like “it’s the prices stupid”. If we don’t respect the complexities of these systems, how could we possibly expect the New York Times to understand them?