Revealing Healthcare Networks Using Insurance Claims Data


As I noted in my post last week, every healthcare accountable care organization in the United States is trying to understand provider networks. Common questions include:

  • What is the “leakage” from our network?
  • What medical practices should we acquire?
  • What are the referral patterns of providers within the network?
  • Does the path that a patient takes through our network of care affect outcomes?
  • Where should we build the next outpatient clinic?

Much of this analysis is being done by using insurance claims data, and this post is about how such data is turned into a provider network analysis.  Here, I’ll discuss how billing or referrals data is turned into graphs of provider networks.  Most of us are now familiar with social networks, which describe how a group of people are “connected”.  A common example is Facebook, where apps like TouchGraph that show who you are friends with, and whether your friends are friends, and so on.  These networks are build with a simple concept, that of a relationship.

To describe a physician network, we first make a table from claims data that shows which physicians (D) billed for visits or procedures on which patients (P).  This is shown in the figure below.  Next, we tally which physicians billed for the seeing the same patient, and how many times, giving a common billing matrix.  The billing does not have to happen at the same visit or for the same problem, just over the course of the measurement period. Notice that the matrix is symmetrical, with the diagonal giving the total number of patient encounters for each doctor.  This type of matrix is referred to as a distance or similarity matrix.


The provider network graph plotted from the above example shows the network relationship between four doctors.  The size of the circle shows total number of patients billed for by that doctor, and the width of the line shows the strength of the shared patient connection.


Now, if we have this data for a large network, we can look at a number of measures using standard methods.  In the above example, we can see that the two orange providers are probably members of a group practice, sharing many of the same patients and referring to many of the same providers. See this humorous post by Kieran Healy identifying Paul Revere as the ringleader of the American Revolution using a similar analysis!  Providers in red are “out-of-network”, and with connections to a single in-network physician.  However, the graph itself does not reveal the reason that these out-of-network providers share patients with the in-network provider.   It could be that the out-of-network group offers a service not available within the network, such as gastric bypass, pediatric hepatology, or kidney transplantation.

It is not difficult to see that you could create network representations using many types of data.  Referral data would allow you to add directionality to the network graph.  You could also look at total charges in shared patients, as opposed to visits or procedures, to get a sense of the financial connectedness of providers or practices.  Linking by lab tests or procedures can show common practice patterns.  Many other variations are possible. Complexity of the network can increase with the more providers and patients in the claims data you have.

These simple graphs are just the beginning.  Couple to network graph with geospatial locations of providers, and you add another layer of complexity.  Add city bus routes, and you can see how patients might get to your next office location.  Add census data, and you can look at the relationship between medical practice density, referral patterns, and the average income within a zip code area.  The possibilities are incredible!

So why is this big data?  To build a large and accurate network, you  need to analyze millions of insurance claims, lab tests, or other connection data.  Analyzing data of this size requires large amounts of computer memory and, often cluster computers, and distributed computing software such as Hadoop (more on this in a future post).  We owe a very large debt to the “Healthcare Hacker” Fred Trotter, who created the first such open source, very large, network graph from 2011 Medicare claims data for the entire United States, called DocGraph. The dataset can be downloaded from NotOnly Dev for $1 here.  This graph has 49 million connections between almost a million providers.  Ryan Weald created a beautiful visualization of the entire DocGraph dataset, which I will leave you with here.


Heathcare Data Privacy and Self-Insured Employers

Merge Data

In the rush to control healthcare costs, many employers are self-insuring.  As part of this move, most self-insured networks have become intensely interested in analyzing their own claims and medication cost data.  This type of analysis can be highly informative.  For example, Fred Trotter has created an enormous Medicare referral network graph (DocGraph) for all physicians and providers in the United States.  Essentially, he took Medicare claims data and counted the number of instances that two physicians billed for care on the same patients.  Physicians were identified by a unique National Practitioner Identifier (NPI) number, which is publicly available here.   By some very simple matrix manipulation on this very large data set of 2011 Medicare claims, he created DocGraph. The resulting data is very simple:  {provider #1, provider #2, number instances where P#1 billed for seeing patients that p#2 also saw at some point}, but very large (49 million relationships).  This graph can be used to identify referral “cliques” (who refers to whom), and other patterns.  The bottom line is that any organization that has claims data, big data storage and processing capabilities, and some very simple analytics can do this.  Similar analyses can be done for medication prescribing patterns, disability claim numbers, and other care-delivery metrics.

Now, this can be a good thing from a business standpoint.  For example, to contain costs, you want most of your patients treated by providers in your network where you have negotiated contracts.  Out-of-network treatments are termed “leakage” by the industry. Network “leakage” analysis can rapidly identify which physicians are referring out-of-network and how often.   Assuming that the equivalent services are available in-network, and this is the key question, you could make these physicians aware of the resources and craft a referral process that makes it easier for them and their patients to access care.

You can also identify physicians who are the “hubs” of your network,  practitioners who are widely connected to others by patient care. These may be the movers-and-shakers of care standards, and the group that you  want to involve in development of new patient care strategies.  For a great example, see this innovative social network analysis of physicians in Italy and their attitudes towards evidence based medicine.

These types of analyses are not without problems and could be used unwisely.  For example, physicians who prescribe expensive, non-generic medications may be highly informed specialists.  Programs that do not take such information into account may unfairly penalize network providers.  In addition, some services may not be available in-network, so providers referring out of network in these cases are actually providing the best care for their patients.  Finally, these analytics could easily be used to identify “high utilizers” of healthcare services, and to better manage their healthcare.  Network analytics are really good at such pattern recognition.  As we move forward, a balanced approach to such analytics is needed, especially to prevent premature conclusions from being drawn from the data.

There is a larger issue also lurking beneath the surface:  employee discrimination based on healthcare data.  Some healthcare networks are triple agents:  healthcare provider, employer, and insurer.  It may be tempting from a business side to use complex analytics to hire or promote employees based on a combined analysis of performance, healthcare and other data.  Google already uses such “people analytics” for hiring.  Some businesses may try to use such profiling, including internal healthcare claims data, to shape their workforce.  Even if individual health data is not used by a company, it seems likely that businesses will use de-identified healthcare data to develop HR  management systems.  See Don Peck’s article in the Atlantic for some interesting reading on “people management” systems.

As a last thought, it’s a bit ironic that we, as a healthcare system in the United States, will be spending hundreds of millions of dollars analyzing whether our patients going “out-of-network” for care, and designing strategies to keep them in network, when this problem does not exist for single-payer National Healthcare Systems…

Primary Care Genomics: The Next Clinical Wave?

DNA Double_HelixIs the main barrier for in healthcare analyzing and connecting the massive amounts of data present in electronic medical records, or is it generating the right data at the right level?  To really move healthcare forward, argues Michael Groner, VP of engineering and chief architect, and Trevor Heritage, we need to move research-level testing (whole exome sequencing, genomics, clinical proteomics) outside of the research environment and make it widely available to primary care physicians.  According to Groner, only when we amass large collections of such data will the true value of big data analytics methods be realized in medicine.

“It’s untenable to expect every physician or health care provider interested in improving patient care through the use of genomics testing to make the costly capital and other investments required to make this science a practical reality that impacts day-to-day patient care. Instead, the aim should be to connect the siloed capabilities associated with genomics testing into a simple, physician-friendly workflow that makes the best services accessible to every provider, regardless of geography or institutional size or affiliation…The true barrier to clinical adoption of genomic medicine isn’t data volume or scale, but how to empower physicians from a logistical and clinical genomics knowledge standpoint, while proving the fundamental efficacy of genomics medicine in terms of improved patient diagnosis, treatment regimens, outcomes and improved patient management.”

It’s a great dream, and parts of it will be realized in the future, but ignores many of the realities of in-the-trenches medical practice and medical science.  Genomics medicine will simply not improve the diagnostic acumen for many clinical problems; it’s just the wrong method.  Some examples include fractures, appendicitis, stroke, heart attacks, and many others.  Sequencing my genome will not diagnose my diverticulitis.  This has nothing to do with making genomic science and whole genome analytics a practical reality, but rather matching the tools to the appropriate medical problem and scale.  Genomics is quite good at providing information about genetic risk of conditions, but not necessarily diagnosing them.  Knowing that somebody has the BRCA1 breast cancer gene mutation does not tell you if they actually have breast cancer, and if they do which breast it’s in, whether it has metastasized, and where.

Groner’s larger point about the need to use data science to make personalized medicine a real-time reality, however, is well taken.  For example, the new guidelines for treatment of cholesterol abnormalities with statins, powerful cholesterol lowering drugs, are based on a risk score that no provider can calculate in their head.  Personalized medicine could evolve to generate a personalized risk assessment, based on a risk score for cardiovascular disease.  Beyond this, one could imagine the risk score being modified by a proteomics analysis of subtle serum proteins and their associated contributions to cardiovascular risk, and a genomic analysis of hereditary risk.  Integrating this evidence and providing clinicians with some measure of how to weight the predicted risk factors when making treatment decisions, are true growth areas for medical genomics and health informatics.

Norovirus, Networks, and Big-Data

Another norvirus outbreak has been in the news related to a group of cases on a cruise ship.  With over  700 passengers and crew falling ill, it is one of the largest outbreaks on a cruise ship ever reported.  Norovirus is a highly contagious member of the Caliciviridae family, and contains multiple genotypes and subtypes.  Small mutations in the norovirus genome lead to new strains, similar to the phenomenon of antigenic shift in influenza viruses.  Larger mutations can lead to pandemic strains when the prevailing population immunity to older strains is no longer effective against the new strain.  The United States is in the midst of the norovirus season, with a new strain being responsible for most cases.

How is Big Data Science revolutionizing the tracking and prediction of norovirus outbreaks?  The US Center for Disease Control now tracks norovirus outbreaks through a combination of traditional outbreak surveillance as reported by public health departments around the US and confirmed by molecular testing of specimens from symptomatic individuals. But, an alternative Big Data real-time social media monitoring approach is being tested in the UK by the Food Standards Agency.  Tweet the hashtag #Barf in London, and your tweet will be added to the FSA statistics, along with the geographic location.  About 50% of gastrointestinal intestinal illnesses in the US and UK are caused by norovirus, so tweets and Google Searches about stomach cramps, vomiting and diarrhea have a high likelihood of being norovirus related!   FSA researchers found an upswing in hashtags describing GI symptoms occurred 3-4 weeks before an outbreak was identified by traditional laboratory surveillance.

#Vomit:  Predicting Norovirus Outbreaks with Twitter

So how can Big Data Science contribute to solutions?  Recognizing outbreaks in real time using Big Data analytics is a start.  Taming data velocity and volume are key here.  Early recognition can lead to containment and public health strategies can limit the outbreak.  But potential solutions go beyond larger public health responses.  One of the major ways individuals can prevent the spread of the virus, and themselves from being infected, is simple good hygiene such as had washing.  Norovirus outbreaks occur more frequently in places where people are living together and have risk factors such as being elderly, immunosuppressed, or very young.  Day care centers, nursing homes and hospitals are the key areas.  In a novel application of Big Data Science real-time analytics, IBM has developed a method of tracking handwashing among healthcare workers after each patient contact.  An RFID tag carried by the worker, couples with sensors which record entry into the room, exit, and use of a hand sanitizer dispenser, have lead to pronounced increases in had-washing.  The data is still out on whether this will reduce infectious outbreaks or their spread, but if the promise bears out, look for such systems in high risk areas such as institutional kitchens, day care centers and other areas.  It does seem a bit Big Brother-ish, which is a topic for my next post…

For now….wash your hands, tweet your symptoms, and stay healthy!

How Much Unstructured Big Medical Data Is There In The EMR?

EMR saladHow much unstructured big data is there in the EMR? Unstructured data is data that doesn’t fit into neat columns on a spreadsheet, or fields and look-up tables in a database, like the narrative text in an HPI. It used to be that we sat down with a pen and the paper chart, and wrote our progress notes in the office and in the clinic. Or, we dictated the notes, which were transcribed. But with the advent of the EMR, templates have crept in, as well as the wide-spread and controversial practice of copying and pasting text from a previous  encounter (see the recent NYT article).

This is  interesting in a quirky way. As physicians, nurse practitioners, and other providers have become reluctant data entry clerks, they use many shortcuts so that they will have time to take care of the patients, including templates with stylized or constrained vocabularies, self-generated “smart phrases”, and patient-specific narratives that can be recalled and modified.  The remainder of the note is populated with structured data already in the system (labs, test results, x-ray results).  Because medical changes are often not so dramatic from one day to the next,  the actual novel unstructured information content from one note to the next may only be a tiny fraction of the total bytes, and probably the change between the current and previous note may carry as much information than the actual content.  But, when people get hurried or sloppy, old information gets carried along that is no longer current, but has not been changed in the notes.  So, the key information extraction question is identifying the true changes, separating them from relatively static or outdated data that is carried along, and extracting the novel information.

How is this relevant to big data analytics in medicine?  If much of the content is captured by a stylized vocabulary, and filled with structured data already present in data tables, how much independent information will there be in a medical note?  If the data has dependencies because of this stylized nature and controlled vocabularies, how does this impact data mining and statistical analytics.  I am not sure if this type of problem has a formal technical term in machine learning, but if not it is likely to get one soon!