Revealing Healthcare Networks Using Insurance Claims Data


As I noted in my post last week, every healthcare accountable care organization in the United States is trying to understand provider networks. Common questions include:

  • What is the “leakage” from our network?
  • What medical practices should we acquire?
  • What are the referral patterns of providers within the network?
  • Does the path that a patient takes through our network of care affect outcomes?
  • Where should we build the next outpatient clinic?

Much of this analysis is being done by using insurance claims data, and this post is about how such data is turned into a provider network analysis.  Here, I’ll discuss how billing or referrals data is turned into graphs of provider networks.  Most of us are now familiar with social networks, which describe how a group of people are “connected”.  A common example is Facebook, where apps like TouchGraph that show who you are friends with, and whether your friends are friends, and so on.  These networks are build with a simple concept, that of a relationship.

To describe a physician network, we first make a table from claims data that shows which physicians (D) billed for visits or procedures on which patients (P).  This is shown in the figure below.  Next, we tally which physicians billed for the seeing the same patient, and how many times, giving a common billing matrix.  The billing does not have to happen at the same visit or for the same problem, just over the course of the measurement period. Notice that the matrix is symmetrical, with the diagonal giving the total number of patient encounters for each doctor.  This type of matrix is referred to as a distance or similarity matrix.


The provider network graph plotted from the above example shows the network relationship between four doctors.  The size of the circle shows total number of patients billed for by that doctor, and the width of the line shows the strength of the shared patient connection.


Now, if we have this data for a large network, we can look at a number of measures using standard methods.  In the above example, we can see that the two orange providers are probably members of a group practice, sharing many of the same patients and referring to many of the same providers. See this humorous post by Kieran Healy identifying Paul Revere as the ringleader of the American Revolution using a similar analysis!  Providers in red are “out-of-network”, and with connections to a single in-network physician.  However, the graph itself does not reveal the reason that these out-of-network providers share patients with the in-network provider.   It could be that the out-of-network group offers a service not available within the network, such as gastric bypass, pediatric hepatology, or kidney transplantation.

It is not difficult to see that you could create network representations using many types of data.  Referral data would allow you to add directionality to the network graph.  You could also look at total charges in shared patients, as opposed to visits or procedures, to get a sense of the financial connectedness of providers or practices.  Linking by lab tests or procedures can show common practice patterns.  Many other variations are possible. Complexity of the network can increase with the more providers and patients in the claims data you have.

These simple graphs are just the beginning.  Couple to network graph with geospatial locations of providers, and you add another layer of complexity.  Add city bus routes, and you can see how patients might get to your next office location.  Add census data, and you can look at the relationship between medical practice density, referral patterns, and the average income within a zip code area.  The possibilities are incredible!

So why is this big data?  To build a large and accurate network, you  need to analyze millions of insurance claims, lab tests, or other connection data.  Analyzing data of this size requires large amounts of computer memory and, often cluster computers, and distributed computing software such as Hadoop (more on this in a future post).  We owe a very large debt to the “Healthcare Hacker” Fred Trotter, who created the first such open source, very large, network graph from 2011 Medicare claims data for the entire United States, called DocGraph. The dataset can be downloaded from NotOnly Dev for $1 here.  This graph has 49 million connections between almost a million providers.  Ryan Weald created a beautiful visualization of the entire DocGraph dataset, which I will leave you with here.


Big Data and the Flu

Flu is one area in medical science where Big Data has made inroads.  Every year, the seasonal influenza strains make their way across the world, infecting tens of millions of people, causing serious illness and even death. Every decade or so, a new stain of flu emerges which differs radically from the prior strains that we have been exposed to.  Because we lack immunologic memory to these pandemic strains, they are able to cause much more serious illness in large segments of the world population.  Thus, tracking the emergence of new influenza viral strains, population level infections, and studying immune responses to both infection and vaccination has generated very large amounts of data, much of which is now publicly available. Basic data is collected each year on the genetic make-up of circulating flu viruses to select the strains which will be included in the next year’s influenza vaccine.  This involves collecting flu virus specimens from all over the world, genetically sequencing them, clustering viruses by sequence similarity, and picking the emerging strains that differ enough to need a new vaccine.  The process culminates in February, when new vaccine strains are chosen by the World Health Organization and the Center for Disease Control. All of this activity has let to and explosion in the number of data sets and types available for public use.  Data sets for influenza research span the gamut of informatics.  At the basic science level, the Influenza Virus Resource contains an extensive database of influenza virus protein sequences.  The Influenza Research Database contains sequences and more, including immune epitope data (which sequence segments protective antibodies or cells recognize on the influenza virus).  These data sets allow scientists to determine how related viruses are to each other by sequence comparison.  A novel dimensional reduction method, which makes use of multi-dimensional scaling (MDS) methods, termed “antigenic cartography” can be found here.  Multidimensional scaling allows complex relationships between influenza virus sequences and vaccine immunity to be reduced to a distance or dissimilarity measure and plotted in two dimensions.  This visualization method allows researchers to show how related different strains of flu are in the immune response they generate.  Other groups, such as my laboratory, have performed detailed time-series experiments and collected data on individual immune responses, including measurements of 24,000 different genes each day for 11 days after vaccination in multiple subjects.  The raw RNAseq gene expression data for this experiment takes up approximately 1.7 terabytes.

Google Flu Trends predictions of influenza activity by web search term aggregation

At the other end of the spectrum are near-real time data on influenza cases across the United States tracked by the Center for Disease Control, in Europe, across the world, tracked by the World Health Organization. A particularly relevant Big Data App is the Google Flu Trends site, which uses big data aggregation methods to tally Google searches for influenza related terms by  geographic location.  Google search activity increases during the seasonal influenza outbreaks, and parallels data from the CDC of confirmed cases of influenza or influenza-like illnesses.  A great example of the “Four V’s” of Big Data in use:  Volume, Velocity, Variety and Veracity. One of my colleagues, Henry Kautz at the University of Rochester, and his graduate student Adam Sadilik (now @Google) have taken this a step further, estimating your odds of getting influenza or other illness by real-time analysis of GIS-linked Twitter feeds!  A demonstration can be found on their web site GermTracker.  They use microblog GIS information coupled with AI methods to analyze key word content linked with illness to determine the likelihood that you are sick, or that you have come into contact with somebody who is sick.  They then predict your odds of coming down with the flu in the next 8 days based on your location. What has been done for flu in terms of public big data can be done for other infectious diseases at multiple levels, and we will likely see an increasing trend towards open source data, crowd sourced predictive analytics, and real-time big data analysis.