Overview of Prometheus metrics with explanations, queries, use cases, and error detection details
router_http_requests_total
: Observes overall HTTP traffic trends and rates.
rate
function will show a change in traffic.
router_http_requests_in_flight
: Tracks concurrent in-flight requests to avoid overloads.
router_http_request_duration_milliseconds
: Measures average request duration to detect latency issues.
router_graphql_operation_planning_time
: Tracks GraphQL query planning time to identify performance inefficiencies.
router_info
: Info metric that provides metadata about every running router configuration.
router_graphql_cache_cost_max
: Measures the maximum cost of cached operations to optimize cache usage.
router_graphql_cache_cost_stats_total
: Tracks total cost statistics of cached operations to ensure balanced resource allocation.
router_graphql_cache_keys_stats_total
: Counts the total number of unique cache keys to monitor cache utilization.
router_graphql_cache_request_stats_total
: Monitors the total number of requests served from the cache to assess cache hit rates.
router_http_client_connection_max
: Static configuration values with the maximum connections allowed per host with a subgraph dimension.
router_http_client_connection_active
: The number of currently active connections, grouped by both subgraph and host. A connection is considered active once it has completed DNS resolution, TLS handshake, and dialing. While it’s less common, multiple subgraphs can share the same host, which is why both dimensions are included.
router_http_client_connection_acquire_duration
: The duration in ms that a connection took to be initialized, which includes all of DNS, TLS Handshakes, and Dialing the host.
Connection
go_memstats_sys_bytes
: Monitors memory obtained from the system across all instances.
go_memstats_heap_alloc_bytes
: Number of heap bytes allocated and still in use across all instances. Focuses on heap memory usage for efficient memory management. The value is same as go_memstats_alloc_bytes.
go_gc_duration_seconds
: Tracks garbage collection duration to identify performance bottlenecks.
go_goroutines
: Monitors active goroutines to prevent resource exhaustion.
router_graphql_operation_planning_time
router_graphql_operation_planning_time
metric is a histogram. It consists of three different metric types, namely:
_count:
This represents the total number of observed events.
_sum:
This is the sum of all observed values, providing the cumulative planning time.
_bucket:
This defines the range of values (buckets) into which observations can fall, useful for calculating percentiles and distribution.
Analyze upper boundary of latency
router_graphql_operation_planning_time_bucket
metric is used to compute quantiles for operation planning time, providing deeper insights into the distribution of planning times rather than just averages.
Example PromQL Query for Quantile:
router_http_request_duration_milliseconds_{sum,count}
router_http_requests_in_flight
router_http_requests_total
router_graphql_cache_cost_max
router_graphql_cache_cost_stats_total
router_graphql_cache_keys_stats_total
router_graphql_cache_request_stats_total
router_info
router_info
metric.
This means that we can assume for the last N (in this case 20) seconds that there would have been two router_info metrics detected for the base configuration.
Reason for Monitoring:
router_info
metric will be available. This can be used to detect whenever the router is down.
go_memstats_alloc_bytes
go_memstats_sys_bytes
go_memstats_alloc_bytes_total
go_memstats_heap_objects
go_memstats_heap_alloc_bytes
go_memstats_heap_inuse_bytes
go_memstats_heap_idle_bytes
go_memstats_next_gc_bytes
go_memstats_mallocs_total
malloc
calls.
Example PromQL Query:
malloc
calls.
go_gc_duration_seconds
go_gc_duration_seconds
metric is a summary that captures the distribution of garbage collection pause times. It helps to analyze how frequently pauses of different durations occur.
Example PromQL Query:
go_gc_duration_seconds
Summarygo_gc_duration_seconds
metric provides the maximum observed value of garbage collection duration. This is useful for identifying the worst-case performance scenario, allowing you to address any outliers or exceptionally long garbage collection pauses which could impact the application’s responsiveness and performance.
Adjust the quantile to your needs.
go_goroutines