The Big Medicare Payment Data Release

Today Medicare released payment data for over 880,000 healthcare providers, and include charge and payment information, provider specialties and addresses, billing codes, and other specific information.  The Medicare data set is downloadable here.  The description on the Medicare web site describes the data set as:

“Provider Utilization and Payment Data: Physician and Other Supplier Public Use File (Physician and Other Supplier PUF), with information on services and procedures provided to Medicare beneficiaries by physicians and other healthcare professionals.  The Physician and Other Supplier PUF contains information on utilization, payment (allowed amount and Medicare payment), and submitted charges organized by National Provider Identifier (NPI), Healthcare Common Procedure Coding System (HCPCS) code, and place of service. This PUF is based on information from CMS’s National Claims History Standard Analytic Files. The data in the Physician and Other Supplier PUF covers calendar year 2012 and contains 100% final-action physician/supplier Part B non-institutional line items for the Medicare fee-for-service population.”

There are some notable caveats to making conclusions about the data, which have been extensively outlined by  Problems such as payer mix and specialty bias should be considered.  For example, pediatricians will have many fewer Medicare patients, while specialties with patients 65+ or special Medicare programs, such as Nephrology (Disclosure:  this my sub-specialty), may have a higher proportion of Medicare insured patients.

How will this large data set help us understand healthcare practices in the United States?  Several promising analyses come to mind:

  • Analysis of varying payment amounts for similar procedures – Because the same medical procedure can be billed on several different codes that account for the complexity of care provided, there is the opportunity for the “Lake Woebegone Effect” – where all the procedures have above average difficulty.  In some cases it might be true that a particular physician specializes in the most difficult cases (e.g. advanced chemotherapy using an implantable pump for liver cancer), but this is the exception rather than the rule.


  • Network analysis of unusual billing patterns -Here is where coupling this database with DocGraph (see my previous post here), a network graph database of all the referral patterns for Medicare for all US patients, may yield very interesting findings.  Some networks of physicians may have unusual billing patterns compared with others.  In some cases, this will be a sign of efficiency and great medical care delivery.  In others, it may be a sign of inefficiency or, in rare cases, something more ominous such as a pattern of fraud among a group or organization of providers.


  • Network analysis of procedure frequency – More useful, will be the ability to study types of procedures and visits among providers in different geographic areas, and the reimbursement variations.  Already, USA Today has posted a map of average reimbursement by state.  While some sophisticated analysis will be needed to reach thoughtful conclusions about regional variations in care, this will certainly spur a great deal of analysis and hopefully some good healthcare policy.


So, a good day for data transparency in healthcare delivery, and I say that as somebody whose Medicare practice is in the database! Let’s hope that high quality data analytics with thoughtful research follows.


Heathcare Data Privacy and Self-Insured Employers

Merge Data

In the rush to control healthcare costs, many employers are self-insuring.  As part of this move, most self-insured networks have become intensely interested in analyzing their own claims and medication cost data.  This type of analysis can be highly informative.  For example, Fred Trotter has created an enormous Medicare referral network graph (DocGraph) for all physicians and providers in the United States.  Essentially, he took Medicare claims data and counted the number of instances that two physicians billed for care on the same patients.  Physicians were identified by a unique National Practitioner Identifier (NPI) number, which is publicly available here.   By some very simple matrix manipulation on this very large data set of 2011 Medicare claims, he created DocGraph. The resulting data is very simple:  {provider #1, provider #2, number instances where P#1 billed for seeing patients that p#2 also saw at some point}, but very large (49 million relationships).  This graph can be used to identify referral “cliques” (who refers to whom), and other patterns.  The bottom line is that any organization that has claims data, big data storage and processing capabilities, and some very simple analytics can do this.  Similar analyses can be done for medication prescribing patterns, disability claim numbers, and other care-delivery metrics.

Now, this can be a good thing from a business standpoint.  For example, to contain costs, you want most of your patients treated by providers in your network where you have negotiated contracts.  Out-of-network treatments are termed “leakage” by the industry. Network “leakage” analysis can rapidly identify which physicians are referring out-of-network and how often.   Assuming that the equivalent services are available in-network, and this is the key question, you could make these physicians aware of the resources and craft a referral process that makes it easier for them and their patients to access care.

You can also identify physicians who are the “hubs” of your network,  practitioners who are widely connected to others by patient care. These may be the movers-and-shakers of care standards, and the group that you  want to involve in development of new patient care strategies.  For a great example, see this innovative social network analysis of physicians in Italy and their attitudes towards evidence based medicine.

These types of analyses are not without problems and could be used unwisely.  For example, physicians who prescribe expensive, non-generic medications may be highly informed specialists.  Programs that do not take such information into account may unfairly penalize network providers.  In addition, some services may not be available in-network, so providers referring out of network in these cases are actually providing the best care for their patients.  Finally, these analytics could easily be used to identify “high utilizers” of healthcare services, and to better manage their healthcare.  Network analytics are really good at such pattern recognition.  As we move forward, a balanced approach to such analytics is needed, especially to prevent premature conclusions from being drawn from the data.

There is a larger issue also lurking beneath the surface:  employee discrimination based on healthcare data.  Some healthcare networks are triple agents:  healthcare provider, employer, and insurer.  It may be tempting from a business side to use complex analytics to hire or promote employees based on a combined analysis of performance, healthcare and other data.  Google already uses such “people analytics” for hiring.  Some businesses may try to use such profiling, including internal healthcare claims data, to shape their workforce.  Even if individual health data is not used by a company, it seems likely that businesses will use de-identified healthcare data to develop HR  management systems.  See Don Peck’s article in the Atlantic for some interesting reading on “people management” systems.

As a last thought, it’s a bit ironic that we, as a healthcare system in the United States, will be spending hundreds of millions of dollars analyzing whether our patients going “out-of-network” for care, and designing strategies to keep them in network, when this problem does not exist for single-payer National Healthcare Systems…

Big Data and the Flu

Flu is one area in medical science where Big Data has made inroads.  Every year, the seasonal influenza strains make their way across the world, infecting tens of millions of people, causing serious illness and even death. Every decade or so, a new stain of flu emerges which differs radically from the prior strains that we have been exposed to.  Because we lack immunologic memory to these pandemic strains, they are able to cause much more serious illness in large segments of the world population.  Thus, tracking the emergence of new influenza viral strains, population level infections, and studying immune responses to both infection and vaccination has generated very large amounts of data, much of which is now publicly available. Basic data is collected each year on the genetic make-up of circulating flu viruses to select the strains which will be included in the next year’s influenza vaccine.  This involves collecting flu virus specimens from all over the world, genetically sequencing them, clustering viruses by sequence similarity, and picking the emerging strains that differ enough to need a new vaccine.  The process culminates in February, when new vaccine strains are chosen by the World Health Organization and the Center for Disease Control. All of this activity has let to and explosion in the number of data sets and types available for public use.  Data sets for influenza research span the gamut of informatics.  At the basic science level, the Influenza Virus Resource contains an extensive database of influenza virus protein sequences.  The Influenza Research Database contains sequences and more, including immune epitope data (which sequence segments protective antibodies or cells recognize on the influenza virus).  These data sets allow scientists to determine how related viruses are to each other by sequence comparison.  A novel dimensional reduction method, which makes use of multi-dimensional scaling (MDS) methods, termed “antigenic cartography” can be found here.  Multidimensional scaling allows complex relationships between influenza virus sequences and vaccine immunity to be reduced to a distance or dissimilarity measure and plotted in two dimensions.  This visualization method allows researchers to show how related different strains of flu are in the immune response they generate.  Other groups, such as my laboratory, have performed detailed time-series experiments and collected data on individual immune responses, including measurements of 24,000 different genes each day for 11 days after vaccination in multiple subjects.  The raw RNAseq gene expression data for this experiment takes up approximately 1.7 terabytes.

Google Flu Trends predictions of influenza activity by web search term aggregation

At the other end of the spectrum are near-real time data on influenza cases across the United States tracked by the Center for Disease Control, in Europe, across the world, tracked by the World Health Organization. A particularly relevant Big Data App is the Google Flu Trends site, which uses big data aggregation methods to tally Google searches for influenza related terms by  geographic location.  Google search activity increases during the seasonal influenza outbreaks, and parallels data from the CDC of confirmed cases of influenza or influenza-like illnesses.  A great example of the “Four V’s” of Big Data in use:  Volume, Velocity, Variety and Veracity. One of my colleagues, Henry Kautz at the University of Rochester, and his graduate student Adam Sadilik (now @Google) have taken this a step further, estimating your odds of getting influenza or other illness by real-time analysis of GIS-linked Twitter feeds!  A demonstration can be found on their web site GermTracker.  They use microblog GIS information coupled with AI methods to analyze key word content linked with illness to determine the likelihood that you are sick, or that you have come into contact with somebody who is sick.  They then predict your odds of coming down with the flu in the next 8 days based on your location. What has been done for flu in terms of public big data can be done for other infectious diseases at multiple levels, and we will likely see an increasing trend towards open source data, crowd sourced predictive analytics, and real-time big data analysis.