Weekly Roundup for Big Data in Medical Science: April 21-28, 2014

 

Image of the Week

Data graphic created from the Institute for Health Metrics and Evaluation web app showing the number of years people with chronic kidney disease live with their disability after diagnosis.

Data graphic created from the Institute for Health Metrics and Evaluation web app showing the number of years people with chronic kidney disease live with their disability after diagnosis.

Fast Upload

This week,  IBM Launches Watson-based big data services for clinical carePersephone, the Real-Time Genome Browser, and yet another online flu web-page view correlation…Wikipedia usage estimates prevalence of influenza-like illness in the United States in near real time

 

Bits and Bytes

Upcoming Events

Link When What Where
MIWC 2014 April 28-29, 2014 Medical Informatics World Conference Boston, MA
BDM 2014 May 21-23 2014  Big Data in Biomedicine Conference Stanford, CA
ASE BDS 2014 May 27-31, 2014  Second ASE International Conference on Big Data Science and Computing Stanford, CA
HCI-KDD@AMT 2014 August 11, 2014  Special Session on Advanced Methods in Interactive Data Mining for Personalized Medicine Warsaw, Poland
BigR&I 2014 August 27-29, 2014  International Symposium on Big Data Research and Innovation Barcelona, Spain
ICHI 2014 September 15-17, 2014  IEEE International Conference on Healthcare Informatics Verona, Italy

Personal Biosensors and the Internet of Medical Things

IoMT

There is tsunami of new devices and apps out that will help you record everything from the number of steps you took in a day to calories and caffeine ingested, sleep quality, weight, blood pressure and blood glucose levels.  The next revolution in Medicine will be the Internet of Medical Things (IoMT), uniquely tagged devices that help monitor blood pressure, blood glucose, physical activity, temperature, sleep, and even motion.  Along with patient entered data from tablets, mobile devices, and conventional desktop computers, data from these devices will change the face of medicine, increase our ability to engage patients in their own health behaviors, and provide massive amounts of data for population health study on an unprecedented scale.

Personal biosensor devices (PBD’s) like Fitbit and Jawbone have become the rage, with many corporations looking to provide PBD’s to employees, with the goal of improving employee health.  Often the devices are paired with financial incentives to motivate people to change behavior.  As reported last year in Wired,  company +Citizen has a program where employees have voluntarily agreed to share their fitness, productivity and happiness data.  Many  vendors, such as FitLinxx, SparkPeople, and Endomondo specifically offer employer packages.

Mobile apps are branching out, and rapidly linking with these devices, allowing coupling of geospatial and biometric data.  The data to be generated by these devices, already in use by hundreds of thousands, if not millions of people, will be staggering.  This past year, clinical research and clinical trials started to incorporate PBD data from smart phones and PBD’s.

At present, it is unclear whether apps or PBD’s will alter health behavior.  Despite their ubiquity, there is little data on improvement in glucose control by diabetics who use such mobile software to manage their blood sugars.  Do weight loss and calorie counting apps really achieve their goals?  I think that it’s fair to say that anectdotal evidence suggests great promise in many cases.  From a personal standpoint, my Fitbit has made me more aware of my sedentary computer habits, and motivated me to take more steps and run out more.  My favorite recent awareness raising app, pointed out to me by my colleague Joshua Schwimmer, is UpCoffee by Jawbone.  I had no idea of the half-life of caffeine before I downloaded the app!

The impact of PBD’s and apps may not be all good, or all predictable.  Sometimes, personal bio-sensing apps can actually lead to bad outcomes.  An article by Alice Gregory in the New Republic last year describes how calorie counting mobile fitness apps can worsen eating disorders.  Given the studies that have described the addictive properties of electronic devices and the internet, and the underlying biology, it is not surprising that these problems can be exacerbated in people with addictive or compulsive behavior tendencies or illnesses.

Where all this leads, we don’t know yet.  Certainly to very large data sets and something far beyond telemedicine.  Something exciting is happening in medicine and research.  I hope that this will lead to the ability to crowdsource population health research questions  and studies beyond our wildest imagination.  What would you study if you had access to data from a million PBD’s?

8 March 2014: Weekly Roundup for Big Data in Medicine and Science

 

Image of the Week

HIVmap_gr2

Young S, Rivers C, Lewis B (2014) Methods of using real-time social media technologies for detection and remote monitoring of HIV outcomes.  Peventive Medicine.  http://dx.doi.org/10.1016/j.ypmed.2014.01.024

 

Fast Upload

This week, big data breaches at LA County medical facilities, more US healthcare delivery companies explore use of data mining and analytics.  At the Healthcare Information Management Systems Society meeting this week, “… all healthcare data is big data, and it’s only going to be getting bigger”.

 

Bits and Bytes

 

Upcoming Events

Link When What Where
BDM 2014 May 21-23 2014  Big Data in Biomedicine Conference Stanford, CA
MIWC 2014 April 28-29, 2014 Medical Informatics World Conference Boston, MA
ASE BDS 2014 May 27-31, 2014  Second ASE International Conference on Big Data Science and Computing Stanford, CA
HCI-KDD@AMT 2014 August 11, 2014  Special Session on Advanced Methods in Interactive Data Mining for Personalized Medicine Warsaw, Poland
BigR&I 2014 August 27-29, 2014  International Symposium on Big Data Research and Innovation Barcelona, Spain
ICHI 2014 September 15-17, 2014  IEEE International Conference on Healthcare Informatics Verona, Italy

Norovirus, Networks, and Big-Data

Another norvirus outbreak has been in the news related to a group of cases on a cruise ship.  With over  700 passengers and crew falling ill, it is one of the largest outbreaks on a cruise ship ever reported.  Norovirus is a highly contagious member of the Caliciviridae family, and contains multiple genotypes and subtypes.  Small mutations in the norovirus genome lead to new strains, similar to the phenomenon of antigenic shift in influenza viruses.  Larger mutations can lead to pandemic strains when the prevailing population immunity to older strains is no longer effective against the new strain.  The United States is in the midst of the norovirus season, with a new strain being responsible for most cases.

How is Big Data Science revolutionizing the tracking and prediction of norovirus outbreaks?  The US Center for Disease Control now tracks norovirus outbreaks through a combination of traditional outbreak surveillance as reported by public health departments around the US and confirmed by molecular testing of specimens from symptomatic individuals. But, an alternative Big Data real-time social media monitoring approach is being tested in the UK by the Food Standards Agency.  Tweet the hashtag #Barf in London, and your tweet will be added to the FSA statistics, along with the geographic location.  About 50% of gastrointestinal intestinal illnesses in the US and UK are caused by norovirus, so tweets and Google Searches about stomach cramps, vomiting and diarrhea have a high likelihood of being norovirus related!   FSA researchers found an upswing in hashtags describing GI symptoms occurred 3-4 weeks before an outbreak was identified by traditional laboratory surveillance.

#Vomit:  Predicting Norovirus Outbreaks with Twitter

So how can Big Data Science contribute to solutions?  Recognizing outbreaks in real time using Big Data analytics is a start.  Taming data velocity and volume are key here.  Early recognition can lead to containment and public health strategies can limit the outbreak.  But potential solutions go beyond larger public health responses.  One of the major ways individuals can prevent the spread of the virus, and themselves from being infected, is simple good hygiene such as had washing.  Norovirus outbreaks occur more frequently in places where people are living together and have risk factors such as being elderly, immunosuppressed, or very young.  Day care centers, nursing homes and hospitals are the key areas.  In a novel application of Big Data Science real-time analytics, IBM has developed a method of tracking handwashing among healthcare workers after each patient contact.  An RFID tag carried by the worker, couples with sensors which record entry into the room, exit, and use of a hand sanitizer dispenser, have lead to pronounced increases in had-washing.  The data is still out on whether this will reduce infectious outbreaks or their spread, but if the promise bears out, look for such systems in high risk areas such as institutional kitchens, day care centers and other areas.  It does seem a bit Big Brother-ish, which is a topic for my next post…

For now….wash your hands, tweet your symptoms, and stay healthy!

Big Data and the Flu

Flu is one area in medical science where Big Data has made inroads.  Every year, the seasonal influenza strains make their way across the world, infecting tens of millions of people, causing serious illness and even death. Every decade or so, a new stain of flu emerges which differs radically from the prior strains that we have been exposed to.  Because we lack immunologic memory to these pandemic strains, they are able to cause much more serious illness in large segments of the world population.  Thus, tracking the emergence of new influenza viral strains, population level infections, and studying immune responses to both infection and vaccination has generated very large amounts of data, much of which is now publicly available. Basic data is collected each year on the genetic make-up of circulating flu viruses to select the strains which will be included in the next year’s influenza vaccine.  This involves collecting flu virus specimens from all over the world, genetically sequencing them, clustering viruses by sequence similarity, and picking the emerging strains that differ enough to need a new vaccine.  The process culminates in February, when new vaccine strains are chosen by the World Health Organization and the Center for Disease Control. All of this activity has let to and explosion in the number of data sets and types available for public use.  Data sets for influenza research span the gamut of informatics.  At the basic science level, the Influenza Virus Resource contains an extensive database of influenza virus protein sequences.  The Influenza Research Database contains sequences and more, including immune epitope data (which sequence segments protective antibodies or cells recognize on the influenza virus).  These data sets allow scientists to determine how related viruses are to each other by sequence comparison.  A novel dimensional reduction method, which makes use of multi-dimensional scaling (MDS) methods, termed “antigenic cartography” can be found here.  Multidimensional scaling allows complex relationships between influenza virus sequences and vaccine immunity to be reduced to a distance or dissimilarity measure and plotted in two dimensions.  This visualization method allows researchers to show how related different strains of flu are in the immune response they generate.  Other groups, such as my laboratory, have performed detailed time-series experiments and collected data on individual immune responses, including measurements of 24,000 different genes each day for 11 days after vaccination in multiple subjects.  The raw RNAseq gene expression data for this experiment takes up approximately 1.7 terabytes.

Google Flu Trends predictions of influenza activity by web search term aggregation

At the other end of the spectrum are near-real time data on influenza cases across the United States tracked by the Center for Disease Control, in Europe, across the world, tracked by the World Health Organization. A particularly relevant Big Data App is the Google Flu Trends site, which uses big data aggregation methods to tally Google searches for influenza related terms by  geographic location.  Google search activity increases during the seasonal influenza outbreaks, and parallels data from the CDC of confirmed cases of influenza or influenza-like illnesses.  A great example of the “Four V’s” of Big Data in use:  Volume, Velocity, Variety and Veracity. One of my colleagues, Henry Kautz at the University of Rochester, and his graduate student Adam Sadilik (now @Google) have taken this a step further, estimating your odds of getting influenza or other illness by real-time analysis of GIS-linked Twitter feeds!  A demonstration can be found on their web site GermTracker.  They use microblog GIS information coupled with AI methods to analyze key word content linked with illness to determine the likelihood that you are sick, or that you have come into contact with somebody who is sick.  They then predict your odds of coming down with the flu in the next 8 days based on your location. What has been done for flu in terms of public big data can be done for other infectious diseases at multiple levels, and we will likely see an increasing trend towards open source data, crowd sourced predictive analytics, and real-time big data analysis.

How Much Unstructured Big Medical Data Is There In The EMR?

EMR saladHow much unstructured big data is there in the EMR? Unstructured data is data that doesn’t fit into neat columns on a spreadsheet, or fields and look-up tables in a database, like the narrative text in an HPI. It used to be that we sat down with a pen and the paper chart, and wrote our progress notes in the office and in the clinic. Or, we dictated the notes, which were transcribed. But with the advent of the EMR, templates have crept in, as well as the wide-spread and controversial practice of copying and pasting text from a previous  encounter (see the recent NYT article).

This is  interesting in a quirky way. As physicians, nurse practitioners, and other providers have become reluctant data entry clerks, they use many shortcuts so that they will have time to take care of the patients, including templates with stylized or constrained vocabularies, self-generated “smart phrases”, and patient-specific narratives that can be recalled and modified.  The remainder of the note is populated with structured data already in the system (labs, test results, x-ray results).  Because medical changes are often not so dramatic from one day to the next,  the actual novel unstructured information content from one note to the next may only be a tiny fraction of the total bytes, and probably the change between the current and previous note may carry as much information than the actual content.  But, when people get hurried or sloppy, old information gets carried along that is no longer current, but has not been changed in the notes.  So, the key information extraction question is identifying the true changes, separating them from relatively static or outdated data that is carried along, and extracting the novel information.

How is this relevant to big data analytics in medicine?  If much of the content is captured by a stylized vocabulary, and filled with structured data already present in data tables, how much independent information will there be in a medical note?  If the data has dependencies because of this stylized nature and controlled vocabularies, how does this impact data mining and statistical analytics.  I am not sure if this type of problem has a formal technical term in machine learning, but if not it is likely to get one soon!