Google pioneered Site Reliability Engineering before the DevOps (Development and Operations) movement gained popularity. That is before organizations started combining software development and IT operations and before they started to encourage better communication between those teams. So it is safe to say Google birthed Site Reliability Engineering.
In 2003, the organization charged a group of software engineers with the responsibility of making its site more reliable, scalable, and efficient. The team developed very responsive practices for Google that other large tech companies started to adopt those practices (see: Amazon and Netflix). Today, Site Reliability Engineering is an IT domain.
What is Site Reliability Engineering?
According to Ben Treynor, the VP of engineering at Google and founder of Google Site Reliability Engineering, it is “what happens when you ask a software engineer to design an operations function.” Site Reliability Engineering creates a bridge between development and operations by applying a software engineering mindset to system administration topics.
Site Reliability Engineering is dedicated to creating software that improves the reliability of systems in production, fixing issues, responding to incidents and usually taking on-call responsibilities.
Responsibilities of a Site Reliability Engineer
Software Building: Site Reliability Engineers and their teams oversee the building and implementing services that will make IT and support better at their jobs. This can range from adjustments to monitoring and alerting, to code changes in production. As a Site Reliability Engineer, you might be tasked with building a homegrown tool from scratch to help with weaknesses in software delivery or incident management.
Solving Issues: Just as the previous point, a site reliability engineer can expect to spend time fixing support escalation issues. And as your Site Reliability Engineering operations mature, you can expect your systems to become more reliable. This means you will see fewer critical incidents in production and fewer support escalations. The fact that a team of Site Reliability Engineers interacts with many different parts of the engineering and IT organization, means they can provide insightful information and help channel issues to the right people and teams that can solve said issues.
On-call Processes: Site Reliability Engineers usually need to take on-call responsibilities. As a Site Reliability Engineer, you will have a lot of say in how your team can optimize system reliability by improving on-call processes.
Documentation: Site Reliability Engineers and their teams gain exposure to systems in both staging and production, as well as all technical teams. This means that they take part in most of the work and build historical knowledge over time. Rather than keeping this knowledge to themselves, they can document what they know so that other teams can access the information when they need it.
Post-incident Evaluation: As a Site Reliability Engineer, you need a way to identify and determine what works and what does not. This is the reason you need to conduct evaluations after an incident. These “checks” will improve the reliability of your service.
Skills of a Site Reliability Engineer
- Software-Centric Mindset: As a Site Reliability Engineer, you must understand that spending most of your time with the development teams, does not mean you have exclusive alignments with the teams. You must be ready to support the development teams and operations teams when necessary. As a Site Reliability Engineer, your primary focus is the reliability and performance of processes, thus you must have an understanding that the way to improve the performance and reliability of these processes.
- Adaptivity: Maintaining relevance requires businesses to put out new features on their applications easily and frequently. As a bridge between Development and Operations, Site Reliability Engineers are required to fix application issues in production, meaning that they make sure minor errors do not cause a major business problem.
Other Skills of a Site Reliability Engineer
- Communication Skills: Site Reliability Engineers must know how to translate technical information into business language. As a Site Reliability Engineer, you must have business-centric conversations using relevant terminologies. You must be ready to translate business requirements to technical implementations and communicate the benefits that your improved reliability can deliver.
- Automation: As a Site Reliability Engineer, you must understand that software automation is faster and more efficient than manual or human intervention when it comes to maintenance and deployment of applications. You must also employ automation tools such as Chef or Puppet. They will help automate repetitive tasks and create a more effective workflow. Site Reliability Engineers use data analytics, algorithms, and runbook automation to optimize problems, automatically detect and respond to issues faster than any human operations team can act.
Educational Requirements for a Site Reliability Engineer
To pursue a career as a Site Reliability Engineer, you will need to complete a bachelor’s degree program. Most employers prefer individuals with a degree in Computer Science. This means your pre-university education will focus on computer and the knowledge of computers.
Below are some high ranking schools where you can study Computer Science.
- The University of Lincoln, United Kingdom: Offers a BSc (Hons) Computer Science The degree allows students to develop the experience, skills, and knowledge to design and develop a variety of software and hardware computing solutions for real-world problems. The classes focus on artificial intelligence and machine learning, in addition to core computer science disciplines. The program will help students develop mathematical, analytical, and problem-solving skills required to succeed in a dynamic and exciting modern computing industry. The program can be taken full-time (3-4 years) or part-time (6 years). The tuition for international students is £15,900.
- Kent State University in the United States of America offers a Bachelor of Science degree in Computer Science. The program is designed to teach students how to understand, design and build complex computer software systems. You will understand how the underlying operating and networking systems interact with local and remote information sources to solve complex problems. The tuition for international students is $21,400.
- Middlesex University in Mauritius offers a BSc Computer Science (Systems Engineering) the program is designed to meet the requirements of a range of ICT companies by preparing students with knowledge and skills acquired from the combination of Computer Science and System Engineering, so that they can build on their technical and critical thinking abilities, within a technology-rich environment. The tuition for international students is MUR 240,000.
Job Outlook and Salary
The U.S. Bureau of Labor Statistics (BLS) does not give specific data for site reliability engineers. However, it is ideal to classify them within the BLS listing for Computer Network Architects. This is because Computer Network Architects work on design plans for accessing clouds and other computer systems. They also review systems so that they can predict how the needs of the systems will change in the future, which are comparable to some parts of the work Site Reliability Engineers do. The BLS projected the job outlook for Computer Network Architects to be 9% from 2014 to 2024.
The average salary of a Site Reliability Engineer in the United States is $131,327 a year. (ziprecruiter.com)
The average salary for a Site Reliability Engineer in the United Kingdom is £73,404. (glassdoor.com)
In Canada, the average salary of a Site Reliability Engineer is $130,000 per year or $66.67 per hour. (nuevoo.ca)
Typical Employers of a Site Reliability Engineer
IT firms are the largest employers of Site Reliability Engineers. Here are examples of companies that hire you:
- Facebook (Site Reliability Engineers called Production Engineers)
- Reddit (you will be called a DevOps Software Engineer)
Careers Related to Site Reliability Engineering
Famous Site Reliability Engineers
- Julia Evans (Twitter and Blog): Julia is a professional software developer. She believes that a deep understanding of the fundamental systems you use (the browser, the kernel, the operating system, the network layers, your database, HTTP) is essential if you want to do technical and innovative work and as well as being able to provide solutions to hard problems.
- Brendan Gregg (on Twitter): He is a Kernel and Performance Engineer at Netflix. Brendan is known for his work in systems performance analysis. He has previously worked at Sun Microsystems, Oracle Corporation, and Joyent. Brendan attended the University of Newcastle, Australia.
- Tammy Butow: Tammy Butow is a Principal SRE at Gremlin. She works on something called Chaos Engineering which involves using controlled experiments to identify systemic weaknesses. She used to work at DigitalOcean and one of Australia’s largest banks in Security Engineering, Product Engineering and Infrastructure Engineering.
- Philip Fisher-Ogden: He is an engineering leader. Philip is currently a Director of Engineering for Product Edge Systems at Netflix. Before Netflix, he worked as a Teach for America teacher. He can proficiently use Java, Tomcat, Apache, Linux, Amazon Web Services (EC2, ELBs, autoscaling), NoSQL (Cassandra, ElasticSearch), and many others.
Postgraduate Options for Site Reliability Engineers
MSc Reliability Engineering and Asset Management from the University of Manchester in the United Kingdom. The program teaches management techniques, organisation, planning and the application of substantial electronic, engineering and analytical knowledge to manufacturing processes, transport, power generation and the efficient operation of industrial, commercial and civic buildings. The duration of the program is 12 months full-time or 36 months part-time. The international tuition is £25,000(full-time) and £25,000 over 3 years, £8,334 per year(part-time).
MSc Safety and Reliability Engineering from the University of Aberdeen. The program focuses on the area of safety engineering, reliability engineering, and loss prevention. It develops creativity, leadership and team-building skills and encourages initiative and personal responsibility in the implementation of the projects. The tuition for international students is £23000. The program is also available in an online part-time format.
Computer Science Masters/MSc degree from the University of Birmingham. The program provides students with a solid foundation in both the fundamentals of computer science and practical software development skills with a choice of in-depth optional modules. The duration of the program is one year. The tuition for international students is £24,120.
Certified Reliability Engineer: This certification teaches professionals to understand the principles of performance evaluation and prediction to improve product/systems safety, reliability and maintainability. Candidates must have worked in a full-time, paid role. Paid intern, co-op or any other course work cannot be applied toward the work experience requirement. The certification will teach you to understand strategic management aspects of reliability engineering. You will also learn its relationship to safety and quality, its impact on warranty programs and customer satisfaction, the consequences of failure, and the potential for liability. You will understand the required planning for reliability programs. Then, learn how various engineering and operational systems must be integrated to achieve overall program goals and alignment with organizational goals. The fee for the certification is $398.
Introduction to Site Reliability Engineering and DevOps: The program is designed to teach candidates how DevOps is influencing software delivery and why it is important for IT operations personnel to skill up with DevOps practices. It also teaches how Cloud Computing has enabled organizations to rapidly build and deploy products and expand capacity. The course is free. But a verified certification costs $199.
The SRE (Site Reliability Engineering) Foundation course is an introduction to the principles & practices that enable an organization to reliably and economically scale critical services. It introduces a range of practices for improving service reliability through a mixture of automation, working methods, and organizational re-alignment. While fit for all who are interested in Site Reliability Engineering, the course is tailored for those focused on large-scale service availability. The course teaches the evolution of Site Reliability Engineering and its future direction. It equally gives participants knowledge of the practices, methods, and tools to engage people using real-life scenarios and case stories. The course ends with an independent exam. It is a 90-minute exam, consisting of 40 multiple-choice questions. If you score at least 65%, you will receive an SRE Foundation Certificate. The fee for the course is $1,595.
Lastly, if you need further guidance on how to pursue a career in Site Reliability Engineering, or perhaps, more affordable schools to kick-start your career in the field, don’t hesitate to send me an email.