Author: Jim Theodoras, ADVA Optical Networking

The growing importance of timing in data centers
Editor’s note: This is the second of a two-part series on the importance of timing in today’s distributed infrastructures. The first ran on Saturday.

Like a bad episode of Hoarders, people love to store all things digital, most of which will never be accessed again. And, like a bad episode of Storage Wars, our love of storing crap means we need more places to store it. Today’s content has outgrown even the hydro-electric dam powered Mega Data Centers built just yesteryear. Increasingly, operators are turning to distributing their information across multiple geographically dispersed data centers. As the number, size, and distances between the data centers have steadily grown, timing distribution and accuracy has likewise grown in importance in keeping the data centers in sync.

In a previous article I discussed new standards being developed to increase the accuracy of timing for the internet and other IP-based networks. Current systems and protocols offer milliseconds of accuracy. But that just isn’t enough as we depend more on real-time information and compute, storage and communications networks become more distributed. While people often cite the importance of timing on mobile backhaul networks in the next-genration LTE-Advanced networks,there has been less publicity around the need for these new timing technologies in the continued growth of data centers.

The rise of Hadoop in an age of digital garbage

Massive storage of data appears to occur in periods, very analogous to dinosaur evolution. A database architecture will rise to the forefront, based upon its advantages, until it scales to the breaking point and is completely superseded by a new architecture. At first, databases were simply serial listed values with row/column arrangements. Database technology leapt forward and became a self-sufficient business with the advent of relational databases. It appeared for a while relational databases would be the end word in information storage, but then came Web 2.0, social media, and the cloud. Enter Hadoop.

A centralized database works, as the name suggests, by having all the data located in a single indexed repository with massive computational power to run operations on it. But a centralized database cannot hope to scale to the size needed by today’s cloud apps. Even if it could, the time needed to perform a single lookup would be unbearable to an end user at a browser window.

Hadoop de-centralizes the storage and lookup, as well as computational power. There is no index, per se. Content is distributed across a wide array of servers, each with their own storage and CPU’s, and the location and relation of each piece of data mapped. When a lookup occurs, the map is read, and all the pieces of information are fetched and pieced together again. The main benefit of Hadoop is scalability. To grow a database (and computational power), you simply keep adding servers and growing your map.

Even Hadoop is buried under mounds of digital debris

It looked like Hadoop would reign supreme for generations to come, with extensions continuously breathing new life into the protocol. Yet, after only a decade, databases based upon Hadoop such as Facebook are at the breaking point. Global traffic is growing beyond exponential, and most of it is trash. Today’s databases look more like landfills than the great Jedi Archives. And recently hyped trends such as lifelogging suggest the problem will get much worse long before it gets better.

The main limitation of Hadoop is that it works great within the walls of a single massive data center, but is less than stellar once that database outgrows the walls of a single data center and has to be run across geographically separated databases. It turns out the main strength of Hadoop is also its Achilles heel. With no index to search, every piece of data must be sorted through, a difficult proposition once databases stretch across the globe. A piece of retrieved data might be stale by the time it reaches a requester, or mirrored copies of data might conflict with one another.

Enter an idea keep widely dispersed data centers in sync — Google True Time. To grossly oversimplify the concept, True Time API adds time attributes to data being stored, not just for expiration dating, but also so that all the geographically disparate data centers’ content can be time aligned. For database aficionados, this is sacrilegious, as all leading database protocols are specifically designed to ignore time to prevent conflicts and confusion. Google True Time completely turns the concept of data storage inside out.

Introducing Spanner

In True Time, knowing the accurate “age” of each piece of information, in other words where it falls on the timeline of data, allows data centers that may be 100ms apart to synchronize not just the values stored in memory locations, but the timeline of values in memory locations. In order for this to work, Google maintains an accurate “global wall-clock time” across their entire global Spanner network.

Transactions that write are time stamped and use strict two phase locking (S2PL) to manage access. The commit order is always the timestamp order. Both commit and timestamp orders respect global wall-clock time. This simple set of rules maintains coordination between databases all over the world.

However, there is an element of uncertainty introduced into each data field, the very reason that time has been shunned in database protocols since the dawn of the data itself.

Google calls this “network-induced uncertainty”, denoted with an epsilon, and actively monitors and tracks this metric. As of summer 2012, this value was running 10ms for 99.9 percent (3 nines) certainty. Google’s long term goal is to reduce this below 1ms. Accomplishing this will require a state of the art timing distribution network, leveraging the same technologies being developed and deployed for 4G LTE backhaul networks.

A modest proposal

While True Time was most likely developed to improve geographic load balancing, now that accurate time stamping of data exists, the possibilities are profound. The problems associated with large databases go beyond simply managing the data. The growth rate itself is unsustainable. Data storage providers must do more than grow their storage, they must also come up with ways to improve efficiencies and ebb the tsunami of waste that is common in the age of relatively free storage.

It’s a dangerous notion, one simply must challenge the basic tenet that all data is forever. Our minds don’t work that way, why should computers? We only hold on to key memories, and the further the time from an event, the fewer the details are held. Perhaps data storage could work similarly. Rather than delete a picture that hasn’t been accessed in a while, a search is performed for similar photos and then only one kept. And as time passes, perhaps rather than simple deletion, a photo is continuously compressed, with less information kept, until the photo memory fades into oblivion. Like that old Polaroid hung on the refrigerator door.

Jim Theodoras is director of technical marketing at ADVA Optical Networking, working on Optical+Ethernet transport products.

Dinosaur image courtesy of Flickr user Denise Chen.

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.
April 28, 2013
Timing is not just for traders anymore, networks need it too
The past few years have seen low-latency networking get a lot of attention, driven primarily by high-frequency traders looking for an edge for their algorithms. However, the importance of communication latency and timing accuracy in general isn’t new. From the dawn of homo sapiens, when cave people first scratched lunar cycles on their cave walls, to the birth of telecommunications, accurately knowing what time it is has been important — for people and for networks.

Yet, in the move to packetized information, and the internet as we know it, timing got left behind. In a fatal mix of both enthusiasm and arrogance, synchronous timing was seen as irrelevant. After all, the world was moving to asynchronous packetized information switched by routers. Why would anyone still need old-fashioned synchronous information? Ma Bell was dead. And what did she know anyway? Fast forward to today and the current standard Network Time Protocol offers timing only to within tens of milliseconds and only within 2 whole seconds in the Windows implementation!

One only need look at the OPERA physics experiment in Gran Sasso to see the critical importance of timing. A single loose optical connector in their timing network produced a 75 nanosecond error, which led to global press coverage of their announcement that neutrinos travel faster than the speed of light. Timing will always be important, as all information is time-variant. There is no way to accurately know the what without knowing the when.

The evolution of timing standards.

With synchronous networking, you got the timing for free, as both the frequency and phase of the clock was buried in the carrier signal. Maintaining accurate timing and synchronization over a network that communicates with variable lengths packets spaced randomly apart is much more challenging.

So challenging, in fact, that network architects are taking a new look at the old approach: timing distribution networks. A throwback to analog phone calls and T1 internet service, the basic premise is that timing is once again embedded in the data being transported, with a clear protocol on how it may and may not be used. What might have been blasphemous to evangelical packet proponents at the start of the asynchronous packet age — making asynchronous networks more synchronous — is now seen as an urgent necessity.

In a modern timing distribution network, there is still an atomic “Master” master clock that serves as the single reference point for the entire network. The challenge is in maintaining that accuracy as the clock is distributed across a transport network. There is nothing that can be done about time of flight of the clock signal, as the speed of light is the speed of light (except in Gran Sasso).

However, if that transport time is accurately known, an offset may be applied, and relative clock accuracy is maintained. GPS and other local clock references do not go away. Rather, they all interconnect and are carefully synchronized, with all available information used to statistically narrow the uncertainty of the exact time at each node in the network. A loss of any source is easily compensated by group knowledge. If a more severe timing outage occurs, all that happens is the standard deviation of existing clock sources may spread a little. Math to the rescue.

Meet the standards.

There is a complex interplay of industry standards making all this happen. IEEE 1588v2 Precision Time Protocol (PTP) defines both how the timing is embedded in packets, as well as how each node should pass or modify the information. ITU-T G.8261/2/4 Synchronous Ethernet (SyncE) locks an output packet signal to the incoming signal in both frequency and phase. For SyncE to work, all links in the chain must support it and be in a locked state; PTP is much more forgiving, as it only requires all nodes to be transparent to the packets, a much lower bar. When both PTP and SyncE are combined, the ultimate in accuracy can be achieved.

While the packet timing standards and technology may be complex, implementation is surprisingly simple. Network devices that support timing are merely added at each node. Where existing customer premises equipment does not support timing, SyncProbes can be added either in series or in parallel to existing links. The timing protocols start working instantly in the background to improve all aspects of packet transport.

Why it matters

These recent advances in network timing have come none too soon. As mobile network operators make the transition to LTE Advanced, the required frequency and phase accuracy can only be achieved with timing distribution networks. LTE Advanced needs not only microsecond timing accuracy, but tight phase alignment as well. An often overlooked fact of LTE Advanced is the sheer number of antenna sites. Even though GPS clock sources continue to drop dramatically in price, it is simply not practical to place a GPS clock source at all sites, nor would they be accurate enough without the additional timing information provided by the timing distribution network .

However, the most important need for accurate timing is the one that goes unnoticed by even the most prognostic of soothsayers: Data centers. In the second of these articles on Sunday, we will look at the growing importance of timing in data centers.

Jim Theodoras is director of technical marketing at ADVA Optical Networking, working on Optical+Ethernet transport products.

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.
April 27, 2013