In part 1 of this post, I described some of the ways big data tools can be used in health, and pointed out the irony that while quality and efficiency uses can frequently fall under the Health Insurance Portability and Accountability Act (HIPAA) “treatment, payment, and operations” in the U.S., patient identifiable data for research by virtue of being “designed to develop or contribute to generalizable knowledge,” must address much more strenuous constraints.
Some big data analytics and observational research can also be done on HIPAA de-identified data. But the traditional issues with de-identified data will be particular obstacles for other big data outcomes. Big data tools and data sets, for example, will increasingly bring re-identification of HIPAA de-identified data to the fore. When larger and broader publicly available data sets are joined with newly de-identified data, existing de-identification approaches become even less durable and identities become easier to re-establish.
De-identified data are also challenging for the kind of deep analytics that are needed to try to differentiate causality from correlation in observational data. It is indeed the patient’s identity that binds together data for patient-centric research and allows continuous aggregation and linking of data over time. Of course, fully de-identified data also does not support communicating with the patient when new findings or therapies are identified.
Analyzing clinical and claims data for quality and efficiency in accountable care is certainly a big driver for considering big data in healthcare right now. In this area, as in others, the flexibility that big data tools have to work with unstructured, as well as structured, data offers help in pursuing this very complex task.
It is important then to also consider improving the sharing and management of less well-structured data. Continuity of care infrastructure requires data from one EHR to be consumed and processed by another EHR via highly structured data messages. Big data approaches benefit from, but are not wholly constrained by, such highly structured data. Standards and technologies for indexing, marking-up, matching and linking data though are important. For the time being these efforts will rely on ad hoc data warehouse accumulation techniques unless other standardization of infrastructure is advanced.
There is an analogous tension on display in the area of healthcare quality, where approaches for advancing the measurement of national quality outcomes do not take advantage of the massive amount of less well-structured electronic data that are already available in the healthcare infrastructure.
Unfortunately, as in other uses, having deeper clinical data that are linkable to the patient can be helpful as well for supporting the investigation of outbreaks, notification of exposures, and suppressing the spread of infectious diseases. Some of these data do not get passed out of clinical care environments. Current efforts to more broadly tap locally held data for investigation and other public health purposes have focused on either remote EHR access or highly structured distributed query. Full leverage of local big data tools, however, could represent a different path to local query that can accommodate less well-structured data and more generalized use.
The wide range of policy and utilization challenges for big data is at least encouraging for the many health uses they suggest. As big data work progresses, consideration of the known utilization obstacles for analytics should become an increasing focus. Different policy considerations will need to be advanced to enable big data outcomes that match the full big data hype.