Weekly Roundup for Big Data in Medical Science: April 21-28, 2014

 

Image of the Week

Data graphic created from the Institute for Health Metrics and Evaluation web app showing the number of years people with chronic kidney disease live with their disability after diagnosis.

Data graphic created from the Institute for Health Metrics and Evaluation web app showing the number of years people with chronic kidney disease live with their disability after diagnosis.

Fast Upload

This week,  IBM Launches Watson-based big data services for clinical carePersephone, the Real-Time Genome Browser, and yet another online flu web-page view correlation…Wikipedia usage estimates prevalence of influenza-like illness in the United States in near real time

 

Bits and Bytes

Upcoming Events

Link When What Where
MIWC 2014 April 28-29, 2014 Medical Informatics World Conference Boston, MA
BDM 2014 May 21-23 2014  Big Data in Biomedicine Conference Stanford, CA
ASE BDS 2014 May 27-31, 2014  Second ASE International Conference on Big Data Science and Computing Stanford, CA
HCI-KDD@AMT 2014 August 11, 2014  Special Session on Advanced Methods in Interactive Data Mining for Personalized Medicine Warsaw, Poland
BigR&I 2014 August 27-29, 2014  International Symposium on Big Data Research and Innovation Barcelona, Spain
ICHI 2014 September 15-17, 2014  IEEE International Conference on Healthcare Informatics Verona, Italy

How Much Unstructured Big Medical Data Is There In The EMR?

EMR saladHow much unstructured big data is there in the EMR? Unstructured data is data that doesn’t fit into neat columns on a spreadsheet, or fields and look-up tables in a database, like the narrative text in an HPI. It used to be that we sat down with a pen and the paper chart, and wrote our progress notes in the office and in the clinic. Or, we dictated the notes, which were transcribed. But with the advent of the EMR, templates have crept in, as well as the wide-spread and controversial practice of copying and pasting text from a previous  encounter (see the recent NYT article).

This is  interesting in a quirky way. As physicians, nurse practitioners, and other providers have become reluctant data entry clerks, they use many shortcuts so that they will have time to take care of the patients, including templates with stylized or constrained vocabularies, self-generated “smart phrases”, and patient-specific narratives that can be recalled and modified.  The remainder of the note is populated with structured data already in the system (labs, test results, x-ray results).  Because medical changes are often not so dramatic from one day to the next,  the actual novel unstructured information content from one note to the next may only be a tiny fraction of the total bytes, and probably the change between the current and previous note may carry as much information than the actual content.  But, when people get hurried or sloppy, old information gets carried along that is no longer current, but has not been changed in the notes.  So, the key information extraction question is identifying the true changes, separating them from relatively static or outdated data that is carried along, and extracting the novel information.

How is this relevant to big data analytics in medicine?  If much of the content is captured by a stylized vocabulary, and filled with structured data already present in data tables, how much independent information will there be in a medical note?  If the data has dependencies because of this stylized nature and controlled vocabularies, how does this impact data mining and statistical analytics.  I am not sure if this type of problem has a formal technical term in machine learning, but if not it is likely to get one soon!