The Future of Hadoop in a Cloud-Based World
April 29 2020
Hadoop once presented the promise of economical storage at massive scale, and streamlined processing of petabytes of data. As WANdisco CEO David Richards explains, though Hadoop took a big hit last year, it will stay with us for a while longer.
We’ve seen tectonic shifts in the big data industry this past year - with some $18 billion worth of acquisitions in the data and analytics space including Salesforce acquiring Tableau, Google acquiring Looker, and CommVault acquiring Hedvig.
This wave of consolidation unquestionably signals a fundamental change in the outlook for Hadoop. Yet even given the recent roller-coaster ride of Cloudera, MapR, and other Hadoop players – it’s too early to eulogize the platform. While Hadoop’s once superstar status is certainly diminished, its existence is not in question.
What is Hadoop?
Hadoop is a Java-based open source framework managed by the Apache Software Foundation, which was designed to store and process massive datasets over clusters of commodity hardware and leveraging simple programming models. Built to scale from individual servers to thousands of servers, Hadoop relies on software rather than hardware for high-availability – meaning the system itself detects and handles failures in the application layer. Hadoop is composed of two primary components – the Hadoop Distributed File System (HDFS) and Yet Another Resource Negotiator (YARN).
HDFS is the main Hadoop data storage system, which employs a NameNode/DataNode architecture to deliver high-performance access to data, in a distributed file system that sits on highly scalable Hadoop clusters. YARN, which was initially named ‘MapReduce 2’ (as the next generation of the wildly-popular ‘MapReduce’), helps schedule jobs and manage resources for all cluster applications. It is also widely used by Hadoop developers to create applications that can work with ultra-large datasets.