Blog

26 SEPTEMBER 2017 David Richards, WANdisco

Fulfilling its real-time Big Data analytics potential: Azure HDInsight Service now supports live multi-location data synchronization

Finally, a breakthrough that promises to transform the way companies use advanced Big Data analytics: Microsoft has announced that Azure HDInsight Service users will now be able to replicate data accurately and in real time across two or more locations via a single-click installation of WANdisco Fusion® to an Azure HD Insight Cluster

This means that it will now be possible to synchronize live data sets between two or more locations in real time – i.e. between day-to-day business systems that are being updated continuously, and the externally-hosted systems and services that are simultaneously crunching that data for other more elaborate and strategically important purposes. In the case of the Microsoft Azure HDInsight Service – Microsoft’s cloud-based solution for Big Data analytics – that secondary use could be social media tracking, IoT/health monitoring, or fraud analytics, for instance. That is, data-intensive applications that are processing and responding to live information feeds on the fly.

The risk with trying to work with live data distributed across more than one geographic location is that unless it is being continuously replicated, there will always be disparity between the different end points. It’s a bit like complex documents requiring input from multiple parties. Without systematic version management or controlled document sharing, there’s always a risk that someone may be working with an older copy of the content – causing chaos.


Shoring up reserves

Without an authoritative, agreed single version of the data ‘truth’, there will be implications not only for the currency of analytics output and the actions this triggers, but also for other scenarios which depend on absolute data synchronicity. 

An obvious one is disaster recovery/business continuity. This is a common first use case for the Cloud: the economics of using a pre-existing, pre-vetted third party to host a copy of important data is very appealing to businesses, compared with setting up their own secondary data center. 

But something they may not be aware of is that, where live systems and real-time data are involved, business continuity can only be assured if those secondary data sets are as complete and up-to-date as the data residing in core, internal systems. If the data involved is on a substantial scale (so that backups rely on data being copied across overnight, or via physical transit between locations using hard disks), the lag between updates poses a practical problem.  

If a major transactional system goes down and the backup copy held somewhere else is up to a day out of sync, that could be a whole day’s bookings, sales or analysis lost. If live systems and remote backups are out of sync by anything more than a few minutes, the time taken to restore live activity – and the disruption incurred in the meantime – could be significant. And of course data consumption is growing by the day. IDC predicts that by 2025, the annual data generation will reach 16.1 zetabytes (trillion gigabytes) – 10 times that produced in 2016. So this is a situation that is only going to intensify. 

Active, ongoing data replication protects organizations against downtime, because it ensures that there is always a true, current copy of the live data in a second location that can be swapped in to play at a moment’s notice.


Defaulting to the cloud

Hybrid infrastructure scenarios – where organizations run some systems internally, but use the cloud for particular applications or processes - also depend on synchronicity. If on-premise systems and remotely-hosted applications share data, it had better be identical.  Gartner predicts that by 2020, 90 percent of organizations will adopt hybrid infrastructure management capabilities. So, again, the importance of solving the continuous data synchronization issue will only grow more critical over time.

Active replication also paves the way for companies to ‘burst’ into the cloud - tapping into flexible, affordable additional data storage capacity and processing power on demand to service peak demand, special compute-intensive projects, or pop-up offices. As our reliance on Big Data continues to grow, we can bet that organizations will soon be doing this increasingly routinely. Analyst firm 451 notes that using an on-site private cloud environment combined with burst capacity to public clouds is often more economical and less disruptive than putting everything in the public cloud.

These sought-after scenarios just wouldn’t be viable without the assurance of completely consistent data between the dispersed IT locations - or not without a great deal of complexity and additional cost. So the Microsoft announcement is an important milestone for Azure HDInsight Service. 

It means organizations can do even more with their data – reliably, in the cloud. They’re covered by important controls too – for instance, over which subsets of content go where, satisfying data sovereignty, data protection and data availability requirements. 

Most importantly, volumes of data are no limit to what users can do with it: because data is continuously being synchronized, companies can avoid the hassle and disruption of shipping physical data containers between locations to mine it and discover new insights - an impractical workaround the data center industry has had to come up with as the world’s hunger for data and its insights soars.