https://www.slideshare.net/DevopsCon/monitoring-at-facebook-ran-leibman-facebook-devopsdays-tel-aviv-2015

Facebook redesigned the data center network: 3 reasons it matters

Named the Facebook Auto-Remediation system, or FBAR, Power’s creation is a system of scripts, APIs and plugins that work together get failed servers back online. At a high level, FBAR works by constantly scanning Facebook’s monitoring system for new outages, then undertaking a workflow to fix the problem. Because FBAR has access to hardware and configuration data, as well as the ability to execute commands on host servers, it’s able to solve some issues by itself. Others — such as a failed hard drive — are marked for human resolution.

The idea of self-healing systems is nothing new, of course — Google (s goog) and other web properties do it to some degree, and IBM (s ibm) has beenpushing Autonomic Computingfor years — but it’s interesting to see new approaches to the problem. Additionally, it’s fascinating to see how a well-designed system can eliminate the need for huge IT departments. As Power notes:

Today, the FBAR service is developed and maintained by two full time engineers, but according to the most recent metrics, it’s doing the work of approximately 200 full time system administrators. FBAR now manages more than 50% of the Facebook infrastructure and we’ve found that services have dramatic increases in reliability when they go under FBAR control.

One of the key processes the company has automated is diagnosing and remediating server issues. Its home-bakedFBAR

system (stands for Facebook Auto-Remediation) takes a problem through three stages of remediation, and a ticket gets generated for a human tech only if the problem isn’t solved automatically.

The company’s data center team culture is similar. People who end up getting hired for Facebook data center jobs are usually flexible and open minded, comfortable in an environment with quickly shifting priorities. “We really are looking for people who are comfortable with moving fast,” Eberly says, “people who can pivot and make a change quickly.”

it’s running multiple data centers across the globe that store data and handle traffic for a large percentage of the world’s population.

One of Facebook’s latest engineering mantras, Patchett said, is about “scaling the wall [of users].” A major way the company is attempting to do this is through disaggregation—an effort that started with the Facebook application itself, spread to server design and is now making its way to the data center. For example, building smaller data centers around the world optimized for specific workloads or what applications are popular in any given geography.

the importance of teams within Facebook working under shared assumptions of what they’re building

results matching ""

    No results matching ""