In my view it is having a dedicated team focusing their full mental bandwidth on pro-actively understanding and managing robustness of the system. In Pure DevOps, it seems to me developers often don't have the full picture of the system, and not enough bandwidth to foresee complex interactions from their changes. These are from my experiences spending one year as a developer in somewhat large a greenfield... - Source: Hacker News / 6 months ago
Site Reliability Engineering, introduced by Google, extends the principles of software engineering to operations. Unlike DevOps, SRE places a stronger emphasis on reliability, availability, and scalability. SRE teams are tasked with maintaining the health and performance of systems by applying engineering practices to operations. The ultimate objective is to achieve a balance between service reliability and... Source: 8 months ago
Define SLOs for availability and latency. Google's SRE book is good reading for this. Source: 11 months ago
Have you gone through the SRE Books? Source: 11 months ago
Google SRE books is always a good read. Source: 12 months ago
The inflection point for me was when I read a book on Site Reliability Engineering someone left on my desk (IDK why); I hated toil and wanted to design systems that just ran. When I finished the book, I knew this was the job that I wanted for my career. I wanted a career that was fulfilling, engaging, and high-paying so this fit the bell (I'll talk about comp in the next post). I started to upskill in that... Source: 12 months ago
Read these books: https://sre.google/books/. Source: 12 months ago
Reading google's SRE books helped me the most during internship. Source: about 1 year ago
That brings me to the last point, you're not doing the right thing, when you want to remind people to look at the dashboards just for the sake of looking at dashboards. There needs to be a reason for that. You should define a SLOs that indicates error rate and response times of your service that you should meet. And then you must take them seriously in your process. If you are tracking worse than the SLO, you must... Source: about 1 year ago
It's important to be well rounded. If operations problems are what you're into then books like Google's series on SRE are a good place to look. Become knowledgeable about cloud computing and building distributed systems in general. Kleppman's Designing Data Intensive Applications is a good one for being good at designing systems. Source: about 1 year ago
Perhaps you don't work on a large enough clojure codebase where this is an issue, but the common symptom on large codebases is that you cannot understand a piece of clojure code in isolation, you must have the entire module or even sometimes, the entire system in your mental context in order to understand the shape of the data some function you care about will receive and what properties it will have. Hmm, that... Source: about 1 year ago
Google published a few free SRE books https://sre.google/books/. Source: about 1 year ago
First step is redundancy: having backups, failover, overprovisioning. Essentially prepared "plan Bs". Next step is introspection: aggregate monitoring and enough detail to figure out if there are issues. Next step is being notified when things break. I.e. Anomaly detection and alerting. Then, debuggability. Enough detail to solve issues. Disaster recovery testing is part of ensuring you actually have this, and not... - Source: Hacker News / about 1 year ago
Https://dl.acm.org/doi/fullHtml/10.1145/2854146 Https://sre.google/books/ Https://cloud.google.com/blog/topics/developers-practitioners/how-google-got-to-rolling-linux-releases-for-desktops?hl=en Https://en.m.wikipedia.org/wiki/Borg_(cluster_manager) Https://research.google/pubs/pub43438/. Source: over 1 year ago
Since you're asking for mentorship I'm assuming you've already read the books on https://sre.google/books/ and related like https://www.amazon.com/Real-World-SRE-Survival-Responding-Maximizing-ebook/dp/B07BJKZQ7Y ? What skills did you think you needed more help with, or concepts were fuzzy? Source: over 1 year ago
Steve Azzopardi started curating a repository called awesome-slo, which is super useful as it aggregates a lot of great related content about the topic, and is a good complement to the Google SRE books. I’m bookmarking it. - Source: dev.to / over 1 year ago
SRE was a role created by google as detailed in their books and largely has a similar goal to the DevOps movement but with more rigidly defined practices and a documentation about how google does it. Source: over 1 year ago
It isn't a measure of software quality. It's an evaluation of how well the software design meets a list of best practices. There are a few best practice guides around (e.g. The 12-factor app, Building Secure & Reliable Systems). But really, it's a measure of how confident you'd be of getting a good night's sleep while carrying the oncall pager for the service that runs the code in production. Source: over 1 year ago
Hey, I would highly recommend looking into Site Reliability Engineering (SRE). A lot of it involves DevOps but SRE is ultimately about applying principles of software engineering to infrastructure problems. If that sounds interesting I would read Google’s book, The Site Reliability Workbook. It is online and a free read. Check it out! https://sre.google/books/. Source: over 1 year ago
Free to read online https://sre.google/books/. Source: over 1 year ago
Google has published three SRE books that are available for free online: https://sre.google/books/. Source: over 1 year ago
Do you know an article comparing Google Site Reliability Engineering to other products?
Suggest a link to a post with product alternatives.
This is an informative page about Google Site Reliability Engineering. You can review and discuss the product here. The primary details have not been verified within the last quarter, and they might be outdated. If you think we are missing something, please use the means on this page to comment or suggest changes. All reviews and comments are highly encouranged and appreciated as they help everyone in the community to make an informed choice. Please always be kind and objective when evaluating a product and sharing your opinion.