Coverage

1 AUGUST 2017 Chris Mellor, The Register

WANdisco sticks Fusion into Amazon's Snowballs for mega-petabyte data pelt

WANdisco is integrating its Fusion product with Amazon's Snowball product, which moves massive amounts of data to its public cloud.

Replication tech integrated with data truck - yes an actual truck...

WANdisco is integrating its Fusion product with Amazon's Snowball product, which moves massive amounts of data to its public cloud.

Snowball is the AWS method of transporting large amounts of data to its public cloud; data amounts so large that digital transmission across a wide-area network (WAN) would take weeks or more and cost a fortune. Data is transferred to drives and these transported to an Amazon data centre where their contents are read and uploaded to the AWS cloud's storage arrays. Vast datasets, up to 45PB, are transported by a truck, a so-called Snowmobile.

WANdisco (Wide-Area Network Distributed Computing) Fusion is replication technology that can handle transactional data and transmit from multiple sources to a destination while the data set at the sources is still in use.

Essentially, what happens is that distributed Paxos algorithm technology, devised by chief scientist Dr Yeturu Aahlad, is used by the several processors involved to register and agree on the order of updates to the global data set. These updates are given a Global Sequence Number (GSN) and that enables them to be applied in sequence (replayed) at the target data centre.

The system can withstand network outages by saving up the registered data events and then having GSNs calculated and the data sent upstream when the network is back up again.

An AWS Snowmobile data transfer can be viewed as a network outage, a fairly prolonged one. By installing Fusion technology both at the Snowmobile source site and AWS destination site, then the Snowmobile data can be uploaded to Amazon, a normal Internet access network pipe to the dataset established, and then the Fusion technology used to "replay" intervening updates at the source site to the AWS-held dataset. This ensures eventual consistency between the source and AWS target datasets.


Why does this matter? If there are two or more updates to a dataset during a network outage then it may not matter if the updates are to different dataset items, aka database records. But if they are to the same record then they need to be applied in sequence, otherwise a disaster might happen.

Suppose the database record is a business' bank balance and it is $1,000,000. Update one is a deposit of $2,000,000 while update two is a withdrawl of $2,000,000. If they are applied in the wrong sequence then the business could find itself having a negative balance of -$1,000,000 with the bank doing bad things like suspending the account.

Guaranteed dataset consistency is a really big deal when you absolutely must have consistency. We understand that WANdisco and Amazon are talking to banking institutions, interested in moving data to the cloud, about this technology integration. It will become a core part of WANdisco Fusion and not a separately branded and charged-for item. ®