Three Considerations for Hadoop-to-Cloud Migration

By Tony Velcich, Aug 03, 2021

Enterprises consider their data and analytics platforms strategic assets that are crucial to digital transformation and business continuity. Yet even as these systems increasingly form the foundations of enterprise business models, some of them remain a massive challenge to organizations. On-premises Hadoop deployments are an example of this — they are complex, unscalable, and increasingly a burden for IT departments.

That’s why more and more enterprises are migrating away from Hadoop and towards modern cloud-based platforms.

Why migrate?

There are numerous forces driving enterprises to migrate away from Hadoop. Often, it’s a combination of inherent Hadoop limitations alongside demands for advanced analytics services from the field - services that Hadoop can’t effectively provide. More specifically, enterprise teams are looking to leave Hadoop due to:

Project roadblocks

Enterprises are discovering that Hadoop can’t keep up with their business goals. If only samples of big data can be processed, rather than entire petabyte-scale datasets; or if network computations can’t be completed even within weeks or months rather than days, then the viability of Hadoop deployments is clearly in question.

Unreliable and unscalable

When clusters can’t scale up to meet computing requirements or scale down to cut costs, enterprises relying on Hadoop are frequently left in data, productivity, and budgetary limbo. And the problem isn’t just with the usage and output of these systems — maintaining, patching, and upgrading Hadoop is an operational and human resources burden, too.

Questionable long term viability

We’ve discussed in previous articles the (rather dire) long-term outlook for on-premises Hadoop. And we’re not the only ones who think so. Even enterprises still strategically committed to Hadoop question the platform’s technological viability and the business stability of its vendors. This is leading enterprises to view Hadoop not only as an impediment, but also as a liability.

Three top Hadoop-to-cloud migration considerations

Once the decision to move away from Hadoop has been made, here are three questions to take into consideration before implementation:

1. What’s the scale of the data migration?

As a rule, the larger the scale, the more complex the migration. And while numerous options exist for small data volumes, few of these work well at scale. Migrating large volumes of data takes time. So, if you’re migrating data over a network, make sure to calculate the time it will take based on your network’s bandwidth while taking into consideration the schedule and size of other workloads.

2. What amount of data changes occur in your Hadoop environment?

Business disruption is a top concern for planned Hadoop migration projects, and handling on-premises data changes during migration is a key challenge noted by enterprises that have already migrated Hadoop data to the cloud. Handling this is challenging because typical Hadoop production environments are very active, with high levels of data ingests and updates. Measurements at one of our customer’s implementations showed peak loads for their on-premises Hadoop deployment reaching upwards of 100,000 file system events per second, and loads over a 24 hour period averaging 20,000 file system events per second. This ongoing activity adds to migration time and complexity, leaving enterprises with three options for managing changes during migration:

  1. Don’t allow changes to happen (leads to system downtime and business disruption)

  2. Develop a custom solution to manage changes

  3. Leverage tools (like WANdisco) that are purpose-built to handle changes

3. Will your migration approach require manual or custom development efforts?

There are a number of Hadoop-to-cloud data migration methodologies and approaches, each with its own considerations. For example, data transfer devices like the Azure Data Box can get Petabyte-scale datasets from Point A to Point B. Yet these solutions may require system downtime or some method for handling data changes that occur during the transfer process. Similarly, network-based data transfer with manual reconciliation of data changes may work for small volumes, but isn’t viable at scale.

Hadoop comes packaged with DistCp, a free tool that is frequently used to start data migration projects…but less so to finish them. The problem is that DistCp was designed for inter/intra-cluster copy of data at a specific point in time — not for ongoing changes. DistCp requires multiple passes and custom code or scripts to accommodate changing data, making it impractical for an enterprise-class migration.

Finally, there are next-gen automated migration tools (like WANdisco LiveData Migrator) that allow migrations to occur while production data continues to change — with no system downtime or business disruption. These solutions enable IT resources to focus on strategic development efforts, not on migration code.

The bottom line

As enterprises migrate away from Hadoop in favor of cloud-based platforms, they are looking more closely not just at the end results of migration, but at the process itself. Large-scale enterprise data migration is a massive enterprise project — there’s no question. Yet by choosing the right tools for the job — tools that enable business data to flow freely and core business functions to continue unhindered, even during Petabyte-scale migration — the viability of this strategic shift increases dramatically.


Tony Velcich

Tony is an accomplished product management and marketing leader with over 25 years of experience in the software industry. Tony is currently responsible for product marketing at WANdisco, helping to drive go-to-market strategy, content and activities. Tony has a strong background in data management having worked at leading database companies including Oracle, Informix and TimesTen where he led strategy for areas such as big data analytics for the telecommunications industry, sales force automation, as well as sales and customer experience analytics.

FOLLOW

SUBSCRIBE

Get notified of the latest WANdisco Blog posts and Newsletter.

Our LiveData Story

Related Blog Posts

https://www.wandisco.com/news-events/blog/company/announcing-livedata-platform-for-azure-ga

Company

LiveData Platform for Azure is Now Generally Available

Today, we announced that WANdisco’s LiveData Platform for Azure is generally available. The first na...

Oct 18, 2021

Read More
https://www.wandisco.com/news-events/blog/tech-trends/leverage-data-first-strategy-your-aws-cloud-migration

Tech & Trends

Leverage a Data-First Strategy for Your AWS Cloud Migration

Leverage a Data-First Strategy for Your AWS Cloud Migration

Oct 12, 2021

Read More
https://www.wandisco.com/news-events/blog/tech-trends/how-wandisco-enables-high-availability-distributed-ledgers

Tech & Trends

How WANdisco Enables High Availability for Distributed Ledgers

Overview of recent work integrating WANdisco’s Distributed Coordination Engine (DConE) with two of t...

Aug 13, 2021

Read More

Seeing is Believing. Try WANdisco Now.

Fully-featured, self-service and automated.

Start migrating Hadoop data in minutes, at any scale, to any cloud

Cookies and Privacy

At WANdisco, we respect your concerns about privacy and value the relationship that we have with you.

Like many companies, we use technology on our website to collect information that helps us enhance your experience and our products and services. The cookies that we use at WANdisco allow our website to work and help us to understand what information and advertising is most useful to visitors.

Please take a moment to familiarise yourself with our cookie practices and let us know if you have any questions by getting in touch through any of the methods listed on our "Contact Us" page.

We have tried to keep this Notice as simple as possible, but if you’re not familiar with terms, such as cookies, IP addresses, and browsers, then read about these key terms first.