Gordon Moore, the co-founder of Intel, in 1965 estimated that computer chips will double in speed and complexity roughly every 24 months. Healthcare is already set to generate 2,314 exabytes of data by 2020. Keeping the history of technology and the Moore’s Law, one can guess how quickly healthcare, among other sectors in the U.S., will evolve towards Big data.
We’re dealing with millions of bytes of data every day- 90% of the world’s data has been created in the last two years. Here are some facts about the data-driven world we are living in:
- Facebook stores, accesses, and analyzes 30+ petabytes of user-generated data- images, messages, comments, and even the ‘likes!’
- More than 294 billion emails are sent and received every day, 230+ tweets every day with at least 5 billion people calling, texting, tweeting and browsing on their phones.
- Some experts estimate that each patient will add 4 MB of data to their EMR every year.
Where does data in healthcare come from?
Data in healthcare doesn’t just come from EHRs- in fact, EHR data only represents about 8% of the data required for population health. The rest of the data comes in from other domains:
- Genomic data
- Outcomes-based data
- Biometric data
- Patient-generated data
- Social data
Not surprisingly, at least 80% of this data is unstructured- with no described format or schema. There is already a lot of “volume,” “velocity,” and “variety” in healthcare data- the 3 V’s of Big data- plus, the diversity and the amount of healthcare data is only set to increase.
Moreover, one of the major challenges data-driven industries face today is getting different forms of data into a relational database management system (RDBMS). Structured data is already in a relational format, and incorporating that in an RDBMS is fairly easy. The semi-structured data, like CSV files, XML and JSON files, X12 (835/837) files, HL7 feeds or even MRI images and a doctor’s template-generated notes are; and unstructured data like emails, texts, Word files, images, and videos too make their way into the healthcare data realm.
Big data comes with its own set of challenges
Here are some significant challenges with Big data in healthcare, including dealing with data quality, analytics, data security, and lack of talent:
- Storage of data: Healthcare generates a huge amount of data in the form of clinical, labs, immunization, social determinants, and a lot more.
- Heterogenous data: Healthcare data is unstructured, semi-structured, and structured, making it necessary to have a robust system that can store these varieties of data generated from various sources.
- Accessing and processing speed: If you have only one 100 Mbps I/O channel, and if you were to process 1 TB of data, it will take approximately 2 hours 54 minutes. Now if you were to have four machines, with four channels for the same amount of data- it will take some 43 minutes.
Bringing Hadoop to healthcare
Hadoop is an open-source, Java-based programming framework for data storage and analysis application that can handle large volumes of structured and unstructured data more efficiently than a traditional data warehouse.
There are basically two components in Hadoop:
- HDFS (storage): Hadoop Distributed File System, or HDFS, allows users to store any kind of data across a cluster and stream those data sets at a high bandwidth.
- YARN (processing): YARN is Hadoop’s processing unit, that allows parallel processing of data stored in HDFS by allocating resources and scheduling tasks.
Another core building block in a Hadoop framework is MapReduce. MapReduce is a programming model that is implemented to process and generate large data sets across multiple servers in a Hadoop cluster. It ‘maps’ the tasks across the cluster, ‘filters’ the tasks and ‘reduces’ the result obtained from each cluster to form a cohesive answer to a query.
Consider a simple MapReduce example, where MapReduce processes a file ‘animals.txt’ with the following content:
Apple, Orange, Orange, Lime, Apple, Car, Car, Lime, Car
Here’s how MapReduce will work on this file:
The million-dollar question is: how will healthcare benefit from Hadoop as a solution?
1.) The first problem with healthcare Big data was storing it. The data is voluminous, and ranges from text files, CSV, FL7 feeds, X12 files to images, videos, and a doctor’s notes. This problem is tackled by HDFS that provides a distributed way to store Big data. If you have 1024 MB of data, and you wish to store it in HDFS of 128 MB blocks, HDFS will divide data into 8 blocks as 1024/128 and store it across different data nodes.
Also, since HDFS focuses on ‘horizontal scaling’ over ‘vertical scaling,’ you can always add some extra nodes to the existing framework, instead of scaling up the resources.
2.) The second issue was storing the variety of data. Since in HDFS, there is no defined pre-schema validation, it can even incorporate unstructured data sets.
3.) The third challenge was slow accessing and processing speed- it is important that we move “processing to data” and not “data to processing.” With Hadoop YARN, the processing logic is sent to the slave nodes and then that data is processed across slave nodes parallelly.
What do CTOs, CIOs and IT leaders need to consider before implementing Hadoop?
One of the major challenges with Hadoop is the variety of lesser-known programming languages they have employed. A simple RDBMS can leverage SQL, whereas working on Hadoop will require a knowledge of Scala, Java, or Python. Owing to this, the IT industry has invested heavily to bring SQL in Hadoop and today, there are four widely-known options for SQL on Hadoop:
- Spark SQL
- Apache Drill
Secondly, skill with Hadoop is a challenge. If you plan on implementing Hadoop in your organization, you need to make sure you have a bunch of skilled people to deploy, manage and query data from it. Also, the open source nature of Hadoop requires additive thinking and assembling different layers with an enormous amount of data, so naturally security and segmentation is an issue.
What does the future of Hadoop in healthcare look like?
We all know that the demands of data in healthcare are growing and I think it’s safe to say that Big data is coming to healthcare. Our current data strategies working with limited storages and traditional data warehouses won’t be able to keep up with the boom in data. A Hadoop-based data repository, much like Datashop has the agility, efficiency and the scalability required to prepare for Big data and leverage its insights efficiently to bring about value-based outcomes.
For more updates, Subscribe
If you want to see our efforts in the area, Schedule a quick demo