Skip to main content
Monitor, observe, and troubleshoot your custom connectors in production using the available monitoring and observability tools.

Overview

The platform provides comprehensive monitoring capabilities through:
  • Grafana - Dashboards, metrics visualization, and alerting
  • Kubernetes - Container and cluster-level monitoring
  • Azure APIM - API analytics and usage metrics

Using monitoring tools

Use the monitoring tools to view logs, metrics, and traces for your connectors.

Grafana

Use Grafana for dashboards and metrics visualization.

Prerequisites

To use Grafana, you need:
  • Microsoft Entra ID account, formerly Azure AD
  • Grafana roles assigned in self-service.tfvars:
    team_members = {
      "developer@example.com" = {
        grafana = {
          roles = ["dev-rw", "stg-ro"]
        }
      }
    }
    
Available Grafana roles:
  • {env}-ro: Read-only access to logs and dashboards
  • {env}-rw: Full access including creating dashboards, monitors, and alerts

Opening Grafana

  1. Navigate to: https://ecos{customername}.grafana.net
    • Example: https://ecosecos.grafana.net (where ecos is the customer name)
  2. Sign in with your Microsoft Entra ID credentials
  3. Select your workspace/environment

Grafana data sources

Grafana is pre-configured with data sources named after your customer. Select the appropriate data source based on what you want to view:
Data sourcePurposeExample (for customer “ecos”)
{customer}-logsApplication and container logsecos-logs
{customer}-metricsPrometheus metrics (CPU, memory, request rates)ecos-metrics
{customer}-tracesDistributed traces (request flows)ecos-traces
{customer}-profilesContinuous profiling dataecos-profiles

Query modes

Grafana offers multiple ways to query your data:
  • Builder mode: Visual query builder with dropdowns and filters - ideal for beginners
  • Code mode: Write raw PromQL (metrics) or LogQL (logs) queries - for advanced users
  • Drilldown view: Click on any metric, log, or trace to explore related data and navigate between signals

Monitoring connector health

Monitor your connector’s health using logs, metrics, and traces.

Application logs

Viewing logs in Grafana

To view application logs in Grafana:
  1. Navigate to Explore in the left sidebar
  2. Select the logs data source: Choose {customer}-logs (for example, ecos-logs)
  3. Choose your query mode:
    • Builder: Use the visual label browser to filter by namespace, app, container
    • Code: Write LogQL queries directly
  4. Use Drilldown view: Click on any log line to see details and related traces
  5. Example LogQL queries:
    # View all logs for a specific connector
    {namespace="gc-caboose", app="my-connector"}
    
    # Filter for errors only
    {namespace="gc-caboose", app="my-connector"} |= "ERROR"
    
    # Search for specific text
    {namespace="gc-caboose"} |~ "(?i)exception|error|failed"
    

Log levels

Monitor different log levels:
  • ERROR: Critical issues requiring immediate attention
  • WARN: Potential issues or unusual conditions
  • INFO: General operational information
  • DEBUG: Detailed diagnostic information

Application metrics

Track key performance and resource metrics for your connector.

Key metrics to monitor

Performance metrics:
  • Request rate: Requests per second
  • Response time: P50, P95, P99 latencies
  • Error rate: Percentage of failed requests
  • Throughput: Messages processed per second
Resource metrics:
  • CPU utilization: Current CPU usage
  • Memory usage: Memory consumption
  • Network I/O: Network traffic
  • Disk I/O: Disk read/write operations
Business metrics:
  • Message processing rate: Messages processed successfully
  • API call success rate: External API call success percentage
  • Queue depth: Number of pending messages
  • Processing time: Time to process each message

Viewing metrics in Grafana

To view and analyze metrics:
  1. Navigate to Explore in the left sidebar
  2. Select the metrics data source: Choose {customer}-metrics (for example, ecos-metrics)
  3. Choose your query mode:
    • Builder: Use the visual metric browser to select metrics and apply filters
    • Code: Write PromQL queries directly for advanced queries
  4. Use Drilldown view: Click on any metric to explore related logs and traces
  5. Browse Dashboards: Navigate to Dashboards to view pre-built connector dashboards
  6. Example PromQL queries:
    # Request rate per second
    sum(rate(http_requests_total{app="my-connector"}[5m]))
    
    # 95th percentile response time
    histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app="my-connector"}[5m])) by (le))
    
    # Error rate percentage
    sum(rate(http_requests_total{app="my-connector", status=~"5.."}[5m])) / sum(rate(http_requests_total{app="my-connector"}[5m])) * 100
    

Distributed tracing

Trace requests as they flow through your connector and external services.

Viewing traces in Grafana

To view distributed traces:
  1. Navigate to Explore in the left sidebar
  2. Select the traces data source: Choose {customer}-traces (for example, ecos-traces)
  3. Search for traces:
    • Builder: Use filters for service name, operation, duration, or status
    • Code: Write TraceQL queries for advanced filtering
  4. Use Drilldown view: Click on any trace to see the full request flow and related logs
  5. Correlate with logs: From a trace span, click to view associated log entries

Request tracing

Trace requests across services:
  • View request flow: See how requests traverse through services
  • Identify bottlenecks: Find slow operations
  • Error tracking: See where errors occur in the flow

Health checks

Verify your connector is running and ready to handle requests.

Kubernetes health checks

Monitor connector health at the Kubernetes level:
# Check pod status
kubectl get pods -n <namespace> -l app=my-custom-connector

# View pod events
kubectl describe pod <pod-name> -n <namespace>

# Check readiness
kubectl get endpoints -n <namespace>

Application health endpoints

Monitor app-level health:
  • Live endpoint at /health/live: Indicates whether the app process is running. Kubernetes uses this to decide whether to restart the container.
  • Ready endpoint at /health/ready: Indicates whether the app can accept traffic. Kubernetes uses this to decide whether to route requests to the pod.
  • Metrics endpoint at /metrics: Exposes app metrics in Prometheus format for monitoring and alerting.

Error tracking

Track and analyze errors to identify and resolve issues.

Error monitoring

Track and analyze errors: Common error types:
  • Connection errors: External API connectivity issues
  • Timeout errors: Slow or unresponsive services
  • Authentication errors: Invalid credentials or tokens
  • Validation errors: Invalid data formats
  • Resource errors: Memory or CPU exhaustion

Alerting

Configure alerts in Grafana to receive notifications when issues occur.

Setting up alerts in Grafana

To create an alert rule in Grafana:
  1. Navigate to Alerting > Alert rules in the left sidebar
  2. Click “New alert rule” to create a new rule
  3. Define the query:
    • Select your data source (Prometheus, Loki, etc.)
    • Write the query for the metric you want to monitor
    • Set the threshold condition (for example, > 0.05 for 5% error rate)
  4. Set evaluation behavior:
    • Choose or create a folder for the rule
    • Set the evaluation interval (for example, every 1 minute)
    • Set the pending period before firing (for example, 5 minutes)
  5. Add labels and annotations:
    • Add labels to categorize alerts
    • Add annotations for alert details and runbook links
  6. Configure notifications:
    • Link to a contact point (email, Slack, PagerDuty, etc.)
    • Set notification policy if needed
  7. Save the alert rule

Example alert conditions

MetricConditionDurationSeverity
Error rate> 5%5 minutesCritical
Response time P95> 1000ms10 minutesWarning
CPU usage> 80%5 minutesWarning
Memory usage> 90%5 minutesCritical

Example PromQL queries for alerts

# Error rate alert
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05

# High latency alert
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1

# High CPU usage
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod) > 0.8

Dashboards

View pre-built dashboards or create custom dashboards to visualize key metrics and monitor connector health.

Dashboard folders

Grafana dashboards are organized into two main folders:
FolderPurposeExample dashboards
ecosPlatform infrastructure and componentsKubernetes cluster metrics, ArgoCD deployments, Knative services, Istio service mesh
customApplication-level dashboards for your connectorsRequest counts, response times (P50, P90, P95, P99), error rates, throughput
To browse dashboards:
  1. Navigate to Dashboards in the left sidebar
  2. Select a folder (ecos or custom)
  3. Click on a dashboard to view it

Custom folder dashboards

The custom folder contains application dashboards that show:
  • Request metrics: Total requests, requests per second
  • Latency percentiles: P50, P90, P95, P99 response times
  • Error rates: 4xx and 5xx error percentages
  • Throughput: Messages processed per second

Creating custom dashboards

Recommended dashboard panels:
  • Request rate over time
  • Error rate percentage
  • Response time percentiles
  • CPU and memory usage
  • Active connections
  • Queue depth
  • Recent errors log

Creating dashboards in Grafana

  1. Navigate to Dashboards > New > New Dashboard
  2. Add a visualization and select your panel type:
    • Time series: For metrics over time
    • Stat: For single value displays
    • Logs: For log panels
    • Table: For tabular data
  3. Configure the query:
    • Select the data source (Prometheus for metrics, Loki for logs)
    • Write your PromQL or LogQL query
  4. Customize the panel:
    • Set title and description
    • Configure thresholds and colors
    • Adjust axes and legends
  5. Save the dashboard with a descriptive name

Performance analysis

Analyze performance data to identify and resolve bottlenecks.

Identifying performance issues

Slow response times:
  1. Check external API latency: External services may be slow
  2. Review database queries: Optimize slow queries
  3. Analyze thread pool usage: May need more threads
  4. Check resource constraints: CPU or memory limits
High error rates:
  1. Review error logs: Identify error patterns
  2. Check external service health: External APIs may be down
  3. Validate input data: Ensure data format is correct
  4. Review authentication: Check token expiration
Resource exhaustion:
  1. Monitor memory usage: Check for memory leaks
  2. Review CPU usage: Optimize CPU-intensive operations
  3. Check connection pools: May need to increase pool size
  4. Review thread usage: Optimize thread pool configuration

Log archives

View archived logs for long-term analysis and compliance.

Viewing log archives

For long-term log storage and analysis: Prerequisites:
  • Azure role: Log archive access role assigned
  • Storage account access: Azure Storage Account credentials
  • Key Vault access: For SPN credentials (automation)
Viewing methods:
  • Azure Portal: Browse storage account
  • Azure Storage Explorer: Desktop application
  • Azure CLI: Command-line interface
  • AzCopy: High-performance copying
Use cases:
  • Compliance: Long-term log retention
  • Forensics: Investigating historical issues
  • Analytics: Analyzing trends over time
  • Auditing: Security and compliance audits

API management monitoring

Monitor API usage and performance through Azure API Management.

APIM analytics

Monitor API usage through Azure APIM: Available metrics:
  • Request count: Total API requests
  • Response time: API response times
  • Error rate: Failed API calls
  • Bandwidth: Data transfer volume
  • Subscription usage: Per-subscription metrics
View APIM analytics:
  1. Navigate to Azure Portal
  2. Select APIM instance for your environment
  3. View Analytics: See usage and performance metrics
  4. Export reports: Generate usage reports

Best practices

  1. Set up alerts early: Configure alerts during development
  2. Monitor key metrics: Focus on business-critical metrics
  3. Create dashboards: Visualize important metrics
  4. Review logs regularly: Check logs for issues
  5. Track error trends: Monitor error rates over time
  6. Set up SLOs: Define service level objectives
  7. Document runbooks: Create troubleshooting guides
  8. Regular reviews: Review metrics and alerts periodically

Troubleshooting guide

Use these steps to diagnose and resolve common issues.

High error rate

  1. Check logs: Review error messages in Grafana
  2. Identify pattern: Look for common error types
  3. Check external services: Verify external API health
  4. Review recent changes: Check recent deployments
  5. Scale resources: May need more resources

Slow performance

  1. Check response times: Identify slow operations
  2. Review resource usage: Check CPU and memory
  3. Analyze traces: Find bottlenecks in request flow
  4. Check external dependencies: External services may be slow
  5. Optimize code: Review and optimize slow operations

Memory issues

  1. Check memory usage: Monitor memory consumption
  2. Review heap dumps: Analyze memory usage patterns
  3. Check for leaks: Identify memory leaks
  4. Increase limits: Adjust memory limits if needed
  5. Optimize code: Reduce memory footprint

Connection issues

  1. Check network connectivity: Verify network connections
  2. Review connection pool: Check pool configuration
  3. Monitor connection errors: Track connection failures
  4. Check firewall rules: Verify firewall configurations
  5. Review DNS: Check DNS resolution

Getting help

  • Grafana support: Check Grafana documentation for usage guides
  • Team support: Contact your team lead for access issues
  • Platform support: Create a support ticket for platform issues