A central cooling plant in Google’s Douglas County, Georgia, data center.
Photo: Google/Connie Zhou
If you’re looking for the beating heart of the digital age —
a physical location where the scope, grandeur, and geekiness of the
kingdom of bits become manifest—you could do a lot worse than Lenoir,
North Carolina. This rural city of 18,000 was once rife with furniture
factories. Now it’s the home of a Google data center.
Engineering prowess famously catapulted the 14-year-old search giant
into its place as one of the world’s most successful, influential, and
frighteningly powerful companies. Its constantly refined search
algorithm changed the way we all access and even think about
information. Its equally complex ad-auction platform is a perpetual
money-minting machine. But other, less well-known engineering and
strategic breakthroughs are arguably just as crucial to Google’s
success: its ability to build, organize, and operate a huge network of
servers and fiber-optic cables with an efficiency and speed that rocks
physics on its heels. Google has spread its infrastructure across a
global archipelago of massive buildings—a dozen or so information
palaces in locales as diverse as Council Bluffs, Iowa; St. Ghislain,
Belgium; and soon Hong Kong and Singapore—where an unspecified but huge
number of machines process and deliver the continuing chronicle of human
experience.
This is what makes Google Google: its physical network, its thousands
of fiber miles, and those many thousands of servers that, in aggregate,
add up to the mother of all clouds. This multibillion-dollar
infrastructure allows the company to index 20 billion web pages a day.
To handle more than 3 billion daily search queries. To conduct millions
of ad auctions in real time. To offer free email storage to 425 million
Gmail users. To zip millions of YouTube videos to users every day. To
deliver search results before the user has finished typing the query. In
the near future, when Google releases the wearable computing platform
called Glass, this infrastructure will power its visual search results.
The problem for would-be bards attempting to sing of these data
centers has been that, because Google sees its network as the ultimate
competitive advantage, only critical employees have been permitted even a
peek inside, a prohibition that has most certainly included bards.
Until now.
A server room in Council Bluffs, Iowa.
Photo: Google/Connie Zhou
Here I am, in a huge white building in Lenoir, standing near a
reinforced door with a party of Googlers, ready to become that rarest of
species: an outsider who has been inside one of the company’s data
centers and seen the legendary server floor, referred to simply as “the
floor.” My visit is the latest evidence that Google is relaxing its
black-box policy. My hosts include Joe Kava, who’s in charge of building
and maintaining Google’s data centers, and his colleague Vitaly
Gudanets, who populates the facilities with computers and makes sure
they run smoothly.
A sign outside the floor dictates that no one can enter without
hearing protection, either salmon-colored earplugs that dispensers spit
out like trail mix or panda-bear earmuffs like the ones worn by airline
ground crews. (The noise is a high-pitched thrum from fans that control
airflow.) We grab the plugs. Kava holds his hand up to a security
scanner and opens the heavy door. Then we slip into a thunderdome of
data …
Urs Hölzle had never stepped into a data center
before he was hired by Sergey Brin and Larry Page. A hirsute,
soft-spoken Swiss, Hölzle was on leave as a computer science professor
at UC Santa Barbara in February 1999 when his new employers took him to
the Exodus server facility in Santa Clara. Exodus was a colocation site,
or colo, where multiple companies rent floor space. Google’s “cage” sat
next to servers from eBay and other blue-chip Internet companies. But
the search company’s array was the most densely packed and chaotic. Brin
and Page were looking to upgrade the system, which often took a full
3.5 seconds to deliver search results and tended to crash on Mondays.
They brought Hölzle on to help drive the effort.
It wouldn’t be easy. Exodus was “a huge mess,” Hölzle later recalled.
And the cramped hodgepodge would soon be strained even more. Google was
not only processing millions of queries every week but also stepping up
the frequency with which it indexed the web, gathering every bit of
online information and putting it into a searchable format. AdWords—the
service that invited advertisers to bid for placement alongside search
results relevant to their wares—involved computation-heavy processes
that were just as demanding as search. Page had also become obsessed
with speed, with delivering search results so quickly that it gave the
illusion of mind reading, a trick that required even more servers and
connections. And the faster Google delivered results, the more popular
it became, creating an even greater burden. Meanwhile, the company was
adding other applications, including a mail service that would require
instant access to many petabytes of storage. Worse yet, the tech
downturn that left many data centers underpopulated in the late ’90s was
ending, and Google’s future leasing deals would become much more
costly.
For Google to succeed, it would have to build and operate its own
data centers—and figure out how to do it more cheaply and efficiently
than anyone had before. The mission was codenamed Willpower. Its first
built-from-scratch data center was in The Dalles, a city in Oregon near
the Columbia River.
Hölzle and his team designed the $600 million facility in light of a
radical insight: Server rooms did not have to be kept so cold. The
machines throw off prodigious amounts of heat. Traditionally, data
centers cool them off with giant computer room air conditioners, or
CRACs, typically jammed under raised floors and cranked up to arctic
levels. That requires massive amounts of energy; data centers consume up
to 1.5 percent of all the electricity
in the world.
Data centers consume up to 1.5 percent of all the world’s electricity.
Google realized that the so-called cold aisle in front of the
machines could be kept at a relatively balmy 80 degrees or so—workers
could wear shorts and T-shirts instead of the standard sweaters. And the
“hot aisle,” a tightly enclosed space where the heat pours from the
rear of the servers, could be allowed to hit around 120 degrees. That
heat could be absorbed by coils filled with water, which would then be
pumped out of the building and cooled before being circulated back
inside. Add that to the long list of Google’s accomplishments: The
company broke its CRAC habit.
Google also figured out money-saving ways to cool that water. Many
data centers relied on energy-gobbling chillers, but Google’s big data
centers usually employ giant towers where the hot water trickles down
through the equivalent of vast radiators, some of it evaporating and the
remainder attaining room temperature or lower by the time it reaches
the bottom. In its Belgium facility, Google uses recycled industrial
canal water for the cooling; in Finland it uses seawater.
The company’s analysis of electrical flow unearthed another source of
waste: the bulky uninterrupted-power-supply systems that protected
servers from power disruptions in most data centers. Not only did they
leak electricity, they also required their own cooling systems. But
because Google designed the racks on which it placed its machines, it
could make space for backup batteries next to each server, doing away
with the big UPS units altogether. According to Joe Kava, that scheme
reduced electricity loss by about 15 percent.
All of these innovations helped Google achieve unprecedented energy
savings. The standard measurement of data center efficiency is called
power usage effectiveness, or PUE. A perfect number is 1.0, meaning all
the power drawn by the facility is put to use. Experts considered
2.0—indicating half the power is wasted—to be a reasonable number for a
data center. Google was getting an unprecedented 1.2.
For years Google didn’t share what it was up to. “Our core advantage
really was a massive computer network, more massive than probably anyone
else’s in the world,” says Jim Reese, who helped set up the company’s
servers. “We realized that it might not be in our best interest to let
our competitors know.”
But stealth had its drawbacks. Google was on record as being an
exemplar of green practices. In 2007 the company committed formally to
carbon neutrality, meaning that every molecule of carbon produced by its
activities—from operating its cooling units to running its diesel
generators—had to be canceled by offsets. Maintaining secrecy about
energy savings undercut that ideal: If competitors knew how much energy
Google was saving, they’d try to match those results, and that could
make a real environmental impact. Also, the stonewalling, particularly
regarding The Dalles facility, was becoming almost comical. Google’s
ownership had become a matter of public record, but the company still
refused to acknowledge it.
In 2009, at an event dubbed the Efficient Data Center Summit, Google
announced its latest PUE results and hinted at some of its techniques.
It marked a turning point for the industry, and now companies like
Facebook and Yahoo report similar PUEs.
Make no mistake, though: The green that motivates Google involves
presidential portraiture. “Of course we love to save energy,” Hölzle
says. “But take something like Gmail. We would lose a fair amount of
money on Gmail if we did our data centers and servers the conventional
way. Because of our efficiency, we can make the cost small enough that
we can give it away for free.”
Google’s breakthroughs extend well beyond energy.
Indeed, while Google is still thought of as an Internet company, it has
also grown into one of the world’s largest hardware manufacturers,
thanks to the fact that it builds much of its own equipment. In 1999,
Hölzle bought parts for 2,000 stripped-down “breadboards” from “three
guys who had an electronics shop.” By going homebrew and eliminating
unneeded components, Google built a batch of servers for about $1,500
apiece, instead of the then-standard $5,000. Hölzle, Page, and a third
engineer designed the rigs themselves. “It wasn’t really ‘designed,’”
Hölzle says, gesturing with air quotes.
More than a dozen generations of Google servers later, the company
now takes a much more sophisticated approach. Google knows exactly what
it needs inside its rigorously controlled data centers—speed, power, and
good connections—and saves money by not buying unnecessary extras. (No
graphics cards, for instance, since these machines never power a screen.
And no enclosures, because the motherboards go straight into the
racks.) The same principle applies to its networking equipment, some of
which Google began building a few years ago.
Outside the Council Bluffs data center, radiator-like cooling towers chill water from the server floor down to room temperature.
Photo: Google/Connie Zhou
So far, though, there’s one area where Google hasn’t ventured:
designing its own chips. But the company’s VP of platforms, Bart Sano,
implies that even that could change. “I’d never say never,” he says. “In
fact, I get that question every year. From Larry.”
Even if you reimagine the data center, the advantage won’t mean much
if you can’t get all those bits out to customers speedily and reliably.
And so Google has launched an attempt to wrap the world in fiber. In the
early 2000s, taking advantage of the failure of some telecom
operations, it began buying up abandoned fiber-optic networks, paying
pennies on the dollar. Now, through acquisition, swaps, and actually
laying down thousands of strands, the company has built a mighty empire
of glass.
But when you’ve got a property like YouTube, you’ve got to do even
more. It would be slow and burdensome to have millions of people
grabbing videos from Google’s few data centers. So Google installs its
own server racks in various outposts of its network—mini data centers,
sometimes connected directly to ISPs like Comcast or AT&T—and stuffs
them with popular videos. That means that if you stream, say, a Carly
Rae Jepsen video, you probably aren’t getting it from Lenoir or The
Dalles but from some colo just a few miles from where you are.
Over the years, Google has also built a software system that allows
it to manage its countless servers as if they were one giant entity. Its
in-house developers can act like puppet masters, dispatching thousands
of computers to perform tasks as easily as running a single machine. In
2002 its scientists created Google File System, which smoothly
distributes files across many machines. MapReduce, a Google system for
writing cloud-based applications, was so successful that an open source
version called Hadoop has become an industry standard. Google also
created software to tackle a knotty issue facing all huge data
operations: When tasks come pouring into the center, how do you
determine instantly and most efficiently which machines can best afford
to take on the work? Google has solved this “load-balancing” issue with
an automated system called Borg.
These innovations allow Google to fulfill an idea embodied in a 2009
paper written by Hölzle and one of his top lieutenants, computer
scientist Luiz Barroso: “The computing platform of interest no longer
resembles a pizza box or a refrigerator but a warehouse full of
computers … We must treat the data center itself as one massive
warehouse-scale computer.”
This is tremendously empowering for the people who write Google code.
Just as your computer is a single device that runs different programs
simultaneously—and you don’t have to worry about which part is running
which application—Google engineers can treat seas of servers like a
single unit. They just write their production code, and the system
distributes it across a server floor they will likely never be
authorized to visit. “If you’re an average engineer here, you can be
completely oblivious,” Hölzle says. “You can order x petabytes of
storage or whatever, and you have no idea what actually happens.”
But of course, none of this infrastructure is any good if it isn’t
reliable. Google has innovated its own answer for that problem as
well—one that involves a surprising ingredient for a company built on
algorithms and automation: people.
At 3 am on a chilly winter morning, a small cadre of
engineers begin to attack Google. First they take down the internal
corporate network that serves the company’s Mountain View, California,
campus. Later the team attempts to disrupt various Google data centers
by causing leaks in the water pipes and staging protests outside the
gates—in hopes of distracting attention from intruders who try to steal
data-packed disks from the servers. They mess with various services,
including the company’s ad network. They take a data center in the
Netherlands offline. Then comes the coup de grâce—cutting most of
Google’s fiber connection to Asia.
Turns out this is an inside job. The attackers, working from a
conference room on the fringes of the campus, are actually Googlers,
part of the company’s Site Reliability Engineering team, the people with
ultimate responsibility for keeping Google and its services running.
SREs are not merely troubleshooters but engineers who are also in charge
of getting production code onto the “bare metal” of the servers; many
are embedded in product groups for services like Gmail or search. Upon
becoming an SRE, members of this geek SEAL team are presented with
leather jackets bearing a military-style insignia patch. Every year, the
SREs run this simulated war—called DiRT (disaster recovery testing)—on
Google’s infrastructure. The attack may be fake, but it’s almost
indistinguishable from reality: Incident managers must go through
response procedures as if they were really happening. In some cases,
actual functioning services are messed with. If the teams in charge
can’t figure out fixes and patches to keep things running, the attacks
must be aborted so real users won’t be affected. In classic Google
fashion, the DiRT team always adds a goofy element to its dead-serious
test—a loony narrative written by a member of the attack team. This year
it involves a
Twin Peaks-style supernatural phenomenon that supposedly caused the disturbances. Previous DiRTs were attributed to zombies or aliens.
Some halls in Google’s Hamina, Finland, data center remain vacant—for now.
Photo: Google/Connie Zhou
As the first attack begins, Kripa Krishnan, an upbeat engineer who
heads the annual exercise, explains the rules to about 20 SREs in a
conference room already littered with junk food. “Do not attempt to fix
anything,” she says. “As far as the people on the job are concerned, we
do not exist. If we’re really lucky, we won’t break anything.” Then she
pulls the plug—for real—on the campus network. The team monitors the
phone lines and IRC channels to see when the Google incident managers on
call around the world notice that something is wrong. It takes only
five minutes for someone in Europe to discover the problem, and he
immediately begins contacting others.
“My role is to come up with big tests that really expose weaknesses,”
Krishnan says. “Over the years, we’ve also become braver in how much
we’re willing to disrupt in order to make sure everything works.” How
did Google do this time? Pretty well. Despite the outages in the
corporate network, executive chair Eric Schmidt was able to run a
scheduled global all-hands meeting. The imaginary demonstrators were
placated by imaginary pizza. Even shutting down three-fourths of
Google’s Asia traffic capacity didn’t shut out the continent, thanks to
extensive caching. “This is the best DiRT ever!” Krishnan exclaimed at
one point.
The SRE program began when Hölzle charged an engineer named Ben
Treynor with making Google’s network fail-safe. This was especially
tricky for a massive company like Google that is constantly tweaking its
systems and services—after all, the easiest way to stabilize it would
be to freeze all change. Treynor ended up rethinking the very concept of
reliability. Instead of trying to build a system that never failed, he
gave each service a budget—an amount of downtime it was permitted to
have. Then he made sure that Google’s engineers used that time
productively. “Let’s say we wanted Google+ to run 99.95 percent of the
time,” Hölzle says. “We want to make sure we don’t get that downtime for
stupid reasons, like we weren’t paying attention. We want that downtime
because we push something new.”
Nevertheless, accidents do happen—as Sabrina Farmer learned on the
morning of April 17, 2012. Farmer, who had been the lead SRE on the
Gmail team for a little over a year, was attending a routine design
review session. Suddenly an engineer burst into the room, blurting out,
“Something big is happening!” Indeed: For 1.4 percent of users (a large
number of people), Gmail was down. Soon reports of the outage were all
over Twitter and tech sites. They were even bleeding into mainstream
news.
The conference room transformed into a war room. Collaborating with a
peer group in Zurich, Farmer launched a forensic investigation. A
breakthrough came when one of her Gmail SREs sheepishly admitted, “I
pushed a change on Friday that might have affected this.” Those
responsible for vetting the change hadn’t been meticulous, and when some
Gmail users tried to access their mail, various replicas of their data
across the system were no longer in sync. To keep the data safe, the
system froze them out.
The diagnosis had taken 20 minutes, designing the fix 25 minutes
more—pretty good. But the event went down as a Google blunder. “It’s
pretty painful when SREs trigger a response,” Farmer says. “But I’m
happy no one lost data.” Nonetheless, she’ll be happier if her future
crises are limited to DiRT-borne zombie attacks.
One scenario that dirt never envisioned was the
presence of a reporter on a server floor. But here I am in Lenoir,
earplugs in place, with Joe Kava motioning me inside.
We have passed through the heavy gate outside the facility, with
remote-control barriers evoking the Korean DMZ. We have walked through
the business offices, decked out in Nascar regalia. (Every Google data
center has a decorative theme.) We have toured the control room, where
LCD dashboards monitor every conceivable metric. Later we will climb up
to catwalks to examine the giant cooling towers and backup electric
generators, which look like Beatle-esque submarines, only green. We will
don hard hats and tour the construction site of a second data center
just up the hill. And we will stare at a rugged chunk of land that one
day will hold a third mammoth computational facility.
But now we enter the floor.
Big doesn’t begin to describe
it. Row after row of server racks seem to stretch to eternity. Joe
Montana in his prime could not throw a football the length of it.
During my interviews with Googlers, the idea of hot aisles and cold
aisles has been an abstraction, but on the floor everything becomes
clear. The cold aisle refers to the general room temperature—which Kava
confirms is 77 degrees. The hot aisle is the narrow space between the
backsides of two rows of servers, tightly enclosed by sheet metal on the
ends. A nest of copper coils absorbs the heat. Above are huge fans,
which sound like jet engines jacked through Marshall amps.
The huge fans sound like jet engines jacked through Marshall amps.
We walk between the server rows. All the cables and plugs are in
front, so no one has to crack open the sheet metal and venture into the
hot aisle, thereby becoming barbecue meat. (When someone does have to
head back there, the servers are shut down.) Every server has a sticker
with a code that identifies its exact address, useful if something goes
wrong. The servers have thick black batteries alongside. Everything is
uniform and in place—nothing like the spaghetti tangles of Google’s
long-ago Exodus era.
Blue lights twinkle, indicating … what? A web search? Someone’s Gmail
message? A Glass calendar event floating in front of Sergey’s eyeball?
It could be anything.
Every so often a worker appears—a long-haired dude in shorts
propelling himself by scooter, or a woman in a T-shirt who’s pushing a
cart with a laptop on top and dispensing repair parts to servers like a
psychiatric nurse handing out meds. (In fact, the area on the floor that
holds the replacement gear is called the pharmacy.)
How many servers does Google employ? It’s a question that has dogged
observers since the company built its first data center. It has long
stuck to “hundreds of thousands.” (There are 49,923 operating in the
Lenoir facility on the day of my visit.) I will later come across a clue
when I get a peek inside Google’s data center R&D facility in
Mountain View. In a secure area, there’s a row of motherboards fixed to
the wall, an honor roll of generations of Google’s homebrewed servers.
One sits atop a tiny embossed plaque that reads
july 9, 2008. google’s millionth server.
But executives explain that this is a cumulative number, not
necessarily an indication that Google has a million servers in operation
at once.
Wandering the cold aisles of Lenoir, I realize that the magic number,
if it is even obtainable, is basically meaningless. Today’s machines,
with multicore processors and other advances, have many times the power
and utility of earlier versions. A single Google server circa 2012 may
be the equivalent of 20 servers from a previous generation. In any case,
Google thinks in terms of clusters—huge numbers of machines that act
together to provide a service or run an application. “An individual
server means nothing,” Hölzle says. “We track computer power as an
abstract metric.” It’s the realization of a concept Hölzle and Barroso
spelled out three years ago: the data center as a computer.
As we leave the floor, I feel almost levitated by my peek inside
Google’s inner sanctum. But a few weeks later, back at the Googleplex in
Mountain View, I realize that my epiphanies have limited shelf life.
Google’s intention is to render the data center I visited obsolete.
“Once our people get used to our 2013 buildings and clusters,” Hölzle
says, “they’re going to complain about the current ones.”
Asked in what areas one might expect change, Hölzle mentions data
center and cluster design, speed of deployment, and flexibility. Then he
stops short. “This is one thing I can’t talk about,” he says, a smile
cracking his bearded visage, “because we’ve spent our own blood, sweat,
and tears. I want others to spend their own blood, sweat, and tears
making the same discoveries.” Google may be dedicated to providing
access to all the world’s data, but some information it’s still keeping
to itself.
Senior writer Steven Levy (
steven_levy@wired.com)
interviewed Mary Meeker in issue 20.10.
Source: http://www.wired.com/wiredenterprise/2012/10/ff-inside-google-data-center/all/
More info and photos here:
http://www.extremetech.com/extreme/138053-a-tour-of-googles-top-secret-data-centers