Based on our record, Open Telemetry should be more popular than Google Site Reliability Engineering. It has been mentiond 156 times since March 2021. We are tracking product recommendations and mentions on various public social media platforms and blogs. They can help you identify which product is more popular and what people think of it.
In my view it is having a dedicated team focusing their full mental bandwidth on pro-actively understanding and managing robustness of the system. In Pure DevOps, it seems to me developers often don't have the full picture of the system, and not enough bandwidth to foresee complex interactions from their changes. These are from my experiences spending one year as a developer in somewhat large a greenfield... - Source: Hacker News / 6 months ago
Site Reliability Engineering, introduced by Google, extends the principles of software engineering to operations. Unlike DevOps, SRE places a stronger emphasis on reliability, availability, and scalability. SRE teams are tasked with maintaining the health and performance of systems by applying engineering practices to operations. The ultimate objective is to achieve a balance between service reliability and... Source: 8 months ago
Define SLOs for availability and latency. Google's SRE book is good reading for this. Source: 11 months ago
Have you gone through the SRE Books? Source: 11 months ago
Google SRE books is always a good read. Source: 12 months ago
Our first approach was to implement a separate SDK for each independent technology stack. We decided to use OpenTelemetry which is widely adopted and covers most of our needs. - Source: dev.to / about 1 month ago
Distributed system administrators need mechanisms and tools for monitoring individual nodes in order to analyze the system and promptly detect anomalies. Developers also need effective mechanisms for analyzing, diagnosing issues, and identifying bugs in protocol implementations. Logging, tracing, and collecting metrics are common observability techniques to allow monitoring and obtaining diagnostic information... - Source: dev.to / 29 days ago
When choosing distributed tracing tools, considerations include your technology stack, business requirements, and monitoring complexity. Zipkin, SkyWalking, and OpenTelemetry are popular distributed tracing solutions, each with its unique features. - Source: dev.to / about 2 months ago
You can follow this process with any large token AI system like Claude by identifying tracing data relevant to the code you are working on, using it as context to prompt OpenAI or other LLMs. Generally, you’d generate tracing data by implementing OpenTelemetry (aka OTEL) libraries into your application, adding spans to your functions with Jaeger, or using commercial SaaS tools like Honeycomb and Datadog. - Source: dev.to / about 2 months ago
OneUptime (https://github.com/oneuptime/oneuptime) is the open-source alternative to DataDog. It's 100% free and you can self-host it on your VM / server / cloud or you can use SaaS at https://oneuptime.com NEW UPDATES (since we last posted to HN): We now support OpenTelemetry (https://opentelemetry.io/) natively which will help you to monitor, observe and debug any app, service, database or stack. - Source: Hacker News / 2 months ago
Ganeti - Ganeti is a cluster management tool built on top of existing virtualization technologies.
SigNoz - Open source alternative to Datadog
Apache Helix - A cluster management framework for partitioned and replicated distributed resources
Prometheus - An open-source systems monitoring and alerting toolkit.
Linux Foundation Training - Linux Foundation Training is a complete web-based platform that offers you advanced-level and updated courses and tutorials to learn the developing and programming skills.
Grafana - Data visualization & Monitoring with support for Graphite, InfluxDB, Prometheus, Elasticsearch and many more databases