Flu is one area in medical science where Big Data has made inroads. Every year, the seasonal influenza strains make their way across the world, infecting tens of millions of people, causing serious illness and even death. Every decade or so, a new stain of flu emerges which differs radically from the prior strains that we have been exposed to. Because we lack immunologic memory to these pandemic strains, they are able to cause much more serious illness in large segments of the world population. Thus, tracking the emergence of new influenza viral strains, population level infections, and studying immune responses to both infection and vaccination has generated very large amounts of data, much of which is now publicly available. Basic data is collected each year on the genetic make-up of circulating flu viruses to select the strains which will be included in the next year’s influenza vaccine. This involves collecting flu virus specimens from all over the world, genetically sequencing them, clustering viruses by sequence similarity, and picking the emerging strains that differ enough to need a new vaccine. The process culminates in February, when new vaccine strains are chosen by the World Health Organization and the Center for Disease Control. All of this activity has let to and explosion in the number of data sets and types available for public use. Data sets for influenza research span the gamut of informatics. At the basic science level, the Influenza Virus Resource contains an extensive database of influenza virus protein sequences. The Influenza Research Database contains sequences and more, including immune epitope data (which sequence segments protective antibodies or cells recognize on the influenza virus). These data sets allow scientists to determine how related viruses are to each other by sequence comparison. A novel dimensional reduction method, which makes use of multi-dimensional scaling (MDS) methods, termed “antigenic cartography” can be found here. Multidimensional scaling allows complex relationships between influenza virus sequences and vaccine immunity to be reduced to a distance or dissimilarity measure and plotted in two dimensions. This visualization method allows researchers to show how related different strains of flu are in the immune response they generate. Other groups, such as my laboratory, have performed detailed time-series experiments and collected data on individual immune responses, including measurements of 24,000 different genes each day for 11 days after vaccination in multiple subjects. The raw RNAseq gene expression data for this experiment takes up approximately 1.7 terabytes.
At the other end of the spectrum are near-real time data on influenza cases across the United States tracked by the Center for Disease Control, in Europe, across the world, tracked by the World Health Organization. A particularly relevant Big Data App is the Google Flu Trends site, which uses big data aggregation methods to tally Google searches for influenza related terms by geographic location. Google search activity increases during the seasonal influenza outbreaks, and parallels data from the CDC of confirmed cases of influenza or influenza-like illnesses. A great example of the “Four V’s” of Big Data in use: Volume, Velocity, Variety and Veracity. One of my colleagues, Henry Kautz at the University of Rochester, and his graduate student Adam Sadilik (now @Google) have taken this a step further, estimating your odds of getting influenza or other illness by real-time analysis of GIS-linked Twitter feeds! A demonstration can be found on their web site GermTracker. They use microblog GIS information coupled with AI methods to analyze key word content linked with illness to determine the likelihood that you are sick, or that you have come into contact with somebody who is sick. They then predict your odds of coming down with the flu in the next 8 days based on your location. What has been done for flu in terms of public big data can be done for other infectious diseases at multiple levels, and we will likely see an increasing trend towards open source data, crowd sourced predictive analytics, and real-time big data analysis.