Heathcare Data Privacy and Self-Insured Employers

Merge Data

In the rush to control healthcare costs, many employers are self-insuring.  As part of this move, most self-insured networks have become intensely interested in analyzing their own claims and medication cost data.  This type of analysis can be highly informative.  For example, Fred Trotter has created an enormous Medicare referral network graph (DocGraph) for all physicians and providers in the United States.  Essentially, he took Medicare claims data and counted the number of instances that two physicians billed for care on the same patients.  Physicians were identified by a unique National Practitioner Identifier (NPI) number, which is publicly available here.   By some very simple matrix manipulation on this very large data set of 2011 Medicare claims, he created DocGraph. The resulting data is very simple:  {provider #1, provider #2, number instances where P#1 billed for seeing patients that p#2 also saw at some point}, but very large (49 million relationships).  This graph can be used to identify referral “cliques” (who refers to whom), and other patterns.  The bottom line is that any organization that has claims data, big data storage and processing capabilities, and some very simple analytics can do this.  Similar analyses can be done for medication prescribing patterns, disability claim numbers, and other care-delivery metrics.

Now, this can be a good thing from a business standpoint.  For example, to contain costs, you want most of your patients treated by providers in your network where you have negotiated contracts.  Out-of-network treatments are termed “leakage” by the industry. Network “leakage” analysis can rapidly identify which physicians are referring out-of-network and how often.   Assuming that the equivalent services are available in-network, and this is the key question, you could make these physicians aware of the resources and craft a referral process that makes it easier for them and their patients to access care.

You can also identify physicians who are the “hubs” of your network,  practitioners who are widely connected to others by patient care. These may be the movers-and-shakers of care standards, and the group that you  want to involve in development of new patient care strategies.  For a great example, see this innovative social network analysis of physicians in Italy and their attitudes towards evidence based medicine.

These types of analyses are not without problems and could be used unwisely.  For example, physicians who prescribe expensive, non-generic medications may be highly informed specialists.  Programs that do not take such information into account may unfairly penalize network providers.  In addition, some services may not be available in-network, so providers referring out of network in these cases are actually providing the best care for their patients.  Finally, these analytics could easily be used to identify “high utilizers” of healthcare services, and to better manage their healthcare.  Network analytics are really good at such pattern recognition.  As we move forward, a balanced approach to such analytics is needed, especially to prevent premature conclusions from being drawn from the data.

There is a larger issue also lurking beneath the surface:  employee discrimination based on healthcare data.  Some healthcare networks are triple agents:  healthcare provider, employer, and insurer.  It may be tempting from a business side to use complex analytics to hire or promote employees based on a combined analysis of performance, healthcare and other data.  Google already uses such “people analytics” for hiring.  Some businesses may try to use such profiling, including internal healthcare claims data, to shape their workforce.  Even if individual health data is not used by a company, it seems likely that businesses will use de-identified healthcare data to develop HR  management systems.  See Don Peck’s article in the Atlantic for some interesting reading on “people management” systems.

As a last thought, it’s a bit ironic that we, as a healthcare system in the United States, will be spending hundreds of millions of dollars analyzing whether our patients going “out-of-network” for care, and designing strategies to keep them in network, when this problem does not exist for single-payer National Healthcare Systems…

Primary Care Genomics: The Next Clinical Wave?

DNA Double_HelixIs the main barrier for in healthcare analyzing and connecting the massive amounts of data present in electronic medical records, or is it generating the right data at the right level?  To really move healthcare forward, argues Michael Groner, VP of engineering and chief architect, and Trevor Heritage, we need to move research-level testing (whole exome sequencing, genomics, clinical proteomics) outside of the research environment and make it widely available to primary care physicians.  According to Groner, only when we amass large collections of such data will the true value of big data analytics methods be realized in medicine.

“It’s untenable to expect every physician or health care provider interested in improving patient care through the use of genomics testing to make the costly capital and other investments required to make this science a practical reality that impacts day-to-day patient care. Instead, the aim should be to connect the siloed capabilities associated with genomics testing into a simple, physician-friendly workflow that makes the best services accessible to every provider, regardless of geography or institutional size or affiliation…The true barrier to clinical adoption of genomic medicine isn’t data volume or scale, but how to empower physicians from a logistical and clinical genomics knowledge standpoint, while proving the fundamental efficacy of genomics medicine in terms of improved patient diagnosis, treatment regimens, outcomes and improved patient management.”

It’s a great dream, and parts of it will be realized in the future, but ignores many of the realities of in-the-trenches medical practice and medical science.  Genomics medicine will simply not improve the diagnostic acumen for many clinical problems; it’s just the wrong method.  Some examples include fractures, appendicitis, stroke, heart attacks, and many others.  Sequencing my genome will not diagnose my diverticulitis.  This has nothing to do with making genomic science and whole genome analytics a practical reality, but rather matching the tools to the appropriate medical problem and scale.  Genomics is quite good at providing information about genetic risk of conditions, but not necessarily diagnosing them.  Knowing that somebody has the BRCA1 breast cancer gene mutation does not tell you if they actually have breast cancer, and if they do which breast it’s in, whether it has metastasized, and where.

Groner’s larger point about the need to use data science to make personalized medicine a real-time reality, however, is well taken.  For example, the new guidelines for treatment of cholesterol abnormalities with statins, powerful cholesterol lowering drugs, are based on a risk score that no provider can calculate in their head.  Personalized medicine could evolve to generate a personalized risk assessment, based on a risk score for cardiovascular disease.  Beyond this, one could imagine the risk score being modified by a proteomics analysis of subtle serum proteins and their associated contributions to cardiovascular risk, and a genomic analysis of hereditary risk.  Integrating this evidence and providing clinicians with some measure of how to weight the predicted risk factors when making treatment decisions, are true growth areas for medical genomics and health informatics.

Geospatial Data and HIPAA

How have privacy regulations affected the use of GIS data?

Since 1854, when John Snow used geospatial mapping to locate the well spreading cholera in London, GIS data has been a cornerstone of public health and epidemiology research.  Today, a wealth of data sources are available for research.  For example, locate a patient within a census tract in the United States, and a variety of information such as average income in the area, demographic data, and other census information can be linked directly to your patient-specific study data.  Alternatively, in this innovative study from Brazil GIS mapping software was used to determine that the distance an expectant mother had to travel through urban transportation networks to reach healthcare was an important risk factor for death during pregnancy.  Similar studies have used GIS data to examine infant mortality, rural population HIV-mortality, and tuberculosis control measures.  While geocoding large amounts of data for medical epidemiology studies can be extremely informative, you need to be careful not to run afoul of government privacy laws, especially the HIPAA privacy rule in the United States.

The Health Information Portability and Accountability Act (HIPAA) rules define personal health information (PHI), which may include diagnoses, test results, payment or visit information.  The intent was to protect people against disclosure of health information in conjunction with information that could reveal their identity.  This identification information consists of 18 identifiers, such as name, social security number, and date of birth.  The definition of “identifiable information” also includes any data that would allow another person to re-identify a person directly or indirectly without access to a specific code or key.  For geospatial information, the personal identifiers include a person’s street address and ZIP code.  GIS coordinates are considered  an “equivalent geocode”, meaning that they are as good as a street address.  Imagine a map plotting the location of eight people infected with HIV in a sparsely populated rural area.  It would not take much to match that data up with a specific person.  The point is that all such information needs to be de-identified before it can be released or worked on outside of a HIPAA compliant data storage and analysis environment.

De-identification of GIS data in healthcare research can be thought of as a two part process:  de-identifying data while obtaining a set of coordinates used to plot a person’s location, called geocoding, and de-identifying the data when presenting the results of your research.

Geocoding is the process of translating an address into set of XY coordinates that can be used to plot a location on a map.  You could do this easily by feeding a list addresses into one several geocoding services on the internet such as bulkgeocoderGoogle, Mapquest, cloudmade, or  ArcGIS Online.  But, if you have lists of patient data, this could be a massive HIPAA violation. The best way to make sure you are HIPAA compliant is to use a geocoding firm with which you have a business association agreement (BAA) that will take your information and generate the geocodes in a HIPAA compliant and secure environment. An important best practice is to process a list of addresses that have been separated from any other information, and can only be linked by a secure, randomized key.  Once the geocoding service returns your data, you can link it back to your complete research file.  It is unclear, however, whether submitting a list of  addresses using an e-mail address containing information about a diagnosis (e.g. Researcher@DiabetesInstituteResearch.Org) outside of a BAA would constitute a breach, since one might infer the diagnosis of people at addresses on the list from the organization name.  Best to consult your organization’s privacy officer about this issue.

Once you have done your analysis, and wish to publish plotted geocoded data, it must be done in a way that you cannot identify an individual by examining the data set alone or in combination with other publicly available data.  Think of the map of firearm owners in Westchester county published by a local newspaper.  If it had been a map of people with a diagnosis of leukemia, it would have been a HIPAA violation.  Deidentification methods could be quite sophisticated, such as statistical de-identification.  An interesting workshop sponsored by the department of Health and Human Services discussing these issues can be found here.  Several methods are available to avoid this pitfall:

  • Point aggregation – combining points into geographic bins, such as zip code areas, counties, states, or other areas.  This way, no individual data point is identifiable as a person, but the bins must have a sufficient population and subject density.
  • Geostatistical analysis – One example is creating a probability map, where any area represents the probability of a study subject having a particular condition or value.  Again, no individual points are plotted.
  • “Jittering” data involves adding or subtracting some random values to a precise GIS location so that an individual point is not precisely located on a diagram.
  • Data point displacement by translation, rotation, or change of scale.

Resolution of the map is also important, as is the population density of the area you are plotting data for.  One needs to be careful, as well, that the de-identification methods do not change the validity of your research results.

So, the use of large GIS data sets is a tremendous opportunity for population health research, but requires specific practices with respect to de-identification when analyzing and publishing that data.  Geocode and aggregate carefully!