Monitor - Grand Central Documentation

Monitor, observe, and troubleshoot your custom connectors in production using the available monitoring and observability tools.

Overview

The platform provides comprehensive monitoring capabilities through:

Grafana - Dashboards, metrics visualization, and alerting
Kubernetes - Container and cluster-level monitoring
Azure APIM - API analytics and usage metrics

Using monitoring tools

Use the monitoring tools to view logs, metrics, and traces for your connectors.

Grafana

Use Grafana for dashboards and metrics visualization.

Prerequisites

To use Grafana, you need:

Microsoft Entra ID account, formerly Azure AD

Grafana roles assigned in self-service.tfvars:

team_members = {
  "developer@example.com" = {
    grafana = {
      roles = ["dev-rw", "stg-ro"]
    }
  }
}

Available Grafana roles:

{env}-ro: Read-only access to logs and dashboards
{env}-rw: Full access including creating dashboards, monitors, and alerts

Opening Grafana

Navigate to: https://ecos{customername}.grafana.net
- Example: https://ecosecos.grafana.net (where ecos is the customer name)
Sign in with your Microsoft Entra ID credentials
Select your workspace/environment

Grafana data sources

Grafana is pre-configured with data sources named after your customer. Select the appropriate data source based on what you want to view:

Data source	Purpose	Example (for customer “ecos”)
`{customer}-logs`	Application and container logs	`ecos-logs`
`{customer}-metrics`	Prometheus metrics (CPU, memory, request rates)	`ecos-metrics`
`{customer}-traces`	Distributed traces (request flows)	`ecos-traces`
`{customer}-profiles`	Continuous profiling data	`ecos-profiles`

Query modes

Grafana offers multiple ways to query your data:

Builder mode: Visual query builder with dropdowns and filters - ideal for beginners
Code mode: Write raw PromQL (metrics) or LogQL (logs) queries - for advanced users
Drilldown view: Click on any metric, log, or trace to explore related data and navigate between signals

Monitoring connector health

Monitor your connector’s health using logs, metrics, and traces.

Application logs

Viewing logs in Grafana

To view application logs in Grafana:

Navigate to Explore in the left sidebar
Select the logs data source: Choose {customer}-logs (for example, ecos-logs)
Choose your query mode:
- Builder: Use the visual label browser to filter by namespace, app, container
- Code: Write LogQL queries directly
Use Drilldown view: Click on any log line to see details and related traces

Example LogQL queries:

# View all logs for a specific connector
{namespace="gc-caboose", app="my-connector"}

# Filter for errors only
{namespace="gc-caboose", app="my-connector"} |= "ERROR"

# Search for specific text
{namespace="gc-caboose"} |~ "(?i)exception|error|failed"

Log levels

Monitor different log levels:

ERROR: Critical issues requiring immediate attention
WARN: Potential issues or unusual conditions
INFO: General operational information
DEBUG: Detailed diagnostic information

Application metrics

Track key performance and resource metrics for your connector.

Key metrics to monitor

Performance metrics:

Request rate: Requests per second
Response time: P50, P95, P99 latencies
Error rate: Percentage of failed requests
Throughput: Messages processed per second

Resource metrics:

CPU utilization: Current CPU usage
Memory usage: Memory consumption
Network I/O: Network traffic
Disk I/O: Disk read/write operations

Business metrics:

Message processing rate: Messages processed successfully
API call success rate: External API call success percentage
Queue depth: Number of pending messages
Processing time: Time to process each message

Viewing metrics in Grafana

To view and analyze metrics:

Navigate to Explore in the left sidebar
Select the metrics data source: Choose {customer}-metrics (for example, ecos-metrics)
Choose your query mode:
- Builder: Use the visual metric browser to select metrics and apply filters
- Code: Write PromQL queries directly for advanced queries
Use Drilldown view: Click on any metric to explore related logs and traces
Browse Dashboards: Navigate to Dashboards to view pre-built connector dashboards

Example PromQL queries:

# Request rate per second
sum(rate(http_requests_total{app="my-connector"}[5m]))

# 95th percentile response time
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app="my-connector"}[5m])) by (le))

# Error rate percentage
sum(rate(http_requests_total{app="my-connector", status=~"5.."}[5m])) / sum(rate(http_requests_total{app="my-connector"}[5m])) * 100

Distributed tracing

Trace requests as they flow through your connector and external services.

Viewing traces in Grafana

To view distributed traces:

Navigate to Explore in the left sidebar
Select the traces data source: Choose {customer}-traces (for example, ecos-traces)
Search for traces:
- Builder: Use filters for service name, operation, duration, or status
- Code: Write TraceQL queries for advanced filtering
Use Drilldown view: Click on any trace to see the full request flow and related logs
Correlate with logs: From a trace span, click to view associated log entries

Request tracing

Trace requests across services:

View request flow: See how requests traverse through services
Identify bottlenecks: Find slow operations
Error tracking: See where errors occur in the flow

Health checks

Verify your connector is running and ready to handle requests.

Kubernetes health checks

Monitor connector health at the Kubernetes level:

# Check pod status
kubectl get pods -n <namespace> -l app=my-custom-connector

# View pod events
kubectl describe pod <pod-name> -n <namespace>

# Check readiness
kubectl get endpoints -n <namespace>

Application health endpoints

Monitor app-level health:

Live endpoint at /health/live: Indicates whether the app process is running. Kubernetes uses this to decide whether to restart the container.
Ready endpoint at /health/ready: Indicates whether the app can accept traffic. Kubernetes uses this to decide whether to route requests to the pod.
Metrics endpoint at /metrics: Exposes app metrics in Prometheus format for monitoring and alerting.

Error tracking

Track and analyze errors to identify and resolve issues.

Error monitoring

Track and analyze errors: Common error types:

Connection errors: External API connectivity issues
Timeout errors: Slow or unresponsive services
Authentication errors: Invalid credentials or tokens
Validation errors: Invalid data formats
Resource errors: Memory or CPU exhaustion

Alerting

Configure alerts in Grafana to receive notifications when issues occur.

Setting up alerts in Grafana

To create an alert rule in Grafana:

Navigate to Alerting > Alert rules in the left sidebar
Click “New alert rule” to create a new rule
Define the query:
- Select your data source (Prometheus, Loki, etc.)
- Write the query for the metric you want to monitor
- Set the threshold condition (for example, > 0.05 for 5% error rate)
Set evaluation behavior:
- Choose or create a folder for the rule
- Set the evaluation interval (for example, every 1 minute)
- Set the pending period before firing (for example, 5 minutes)
Add labels and annotations:
- Add labels to categorize alerts
- Add annotations for alert details and runbook links
Configure notifications:
- Link to a contact point (email, Slack, PagerDuty, etc.)
- Set notification policy if needed
Save the alert rule

Example alert conditions

Metric	Condition	Duration	Severity
Error rate	> 5%	5 minutes	Critical
Response time P95	> 1000ms	10 minutes	Warning
CPU usage	> 80%	5 minutes	Warning
Memory usage	> 90%	5 minutes	Critical

Example PromQL queries for alerts

# Error rate alert
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05

# High latency alert
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1

# High CPU usage
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod) > 0.8

Dashboards

View pre-built dashboards or create custom dashboards to visualize key metrics and monitor connector health.

Dashboard folders

Grafana dashboards are organized into two main folders:

Folder	Purpose	Example dashboards
ecos	Platform infrastructure and components	Kubernetes cluster metrics, ArgoCD deployments, Knative services, Istio service mesh
custom	Application-level dashboards for your connectors	Request counts, response times (P50, P90, P95, P99), error rates, throughput

To browse dashboards:

Navigate to Dashboards in the left sidebar
Select a folder (ecos or custom)
Click on a dashboard to view it

Custom folder dashboards

The custom folder contains application dashboards that show:

Request metrics: Total requests, requests per second
Latency percentiles: P50, P90, P95, P99 response times
Error rates: 4xx and 5xx error percentages
Throughput: Messages processed per second

Creating custom dashboards

Recommended dashboard panels:

Request rate over time
Error rate percentage
Response time percentiles
CPU and memory usage
Active connections
Queue depth
Recent errors log

Creating dashboards in Grafana

Navigate to Dashboards > New > New Dashboard
Add a visualization and select your panel type:
- Time series: For metrics over time
- Stat: For single value displays
- Logs: For log panels
- Table: For tabular data
Configure the query:
- Select the data source (Prometheus for metrics, Loki for logs)
- Write your PromQL or LogQL query
Customize the panel:
- Set title and description
- Configure thresholds and colors
- Adjust axes and legends
Save the dashboard with a descriptive name

Performance analysis

Analyze performance data to identify and resolve bottlenecks.

Identifying performance issues

Slow response times:

Check external API latency: External services may be slow
Review database queries: Optimize slow queries
Analyze thread pool usage: May need more threads
Check resource constraints: CPU or memory limits

High error rates:

Review error logs: Identify error patterns
Check external service health: External APIs may be down
Validate input data: Ensure data format is correct
Review authentication: Check token expiration

Resource exhaustion:

Monitor memory usage: Check for memory leaks
Review CPU usage: Optimize CPU-intensive operations
Check connection pools: May need to increase pool size
Review thread usage: Optimize thread pool configuration

Log archives

View archived logs for long-term analysis and compliance.

Viewing log archives

For long-term log storage and analysis: Prerequisites:

Azure role: Log archive access role assigned
Storage account access: Azure Storage Account credentials
Key Vault access: For SPN credentials (automation)

Viewing methods:

Azure Portal: Browse storage account
Azure Storage Explorer: Desktop application
Azure CLI: Command-line interface
AzCopy: High-performance copying

Use cases:

Compliance: Long-term log retention
Forensics: Investigating historical issues
Analytics: Analyzing trends over time
Auditing: Security and compliance audits

API management monitoring

Monitor API usage and performance through Azure API Management.

APIM analytics

Monitor API usage through Azure APIM: Available metrics:

Request count: Total API requests
Response time: API response times
Error rate: Failed API calls
Bandwidth: Data transfer volume
Subscription usage: Per-subscription metrics

View APIM analytics:

Navigate to Azure Portal
Select APIM instance for your environment
View Analytics: See usage and performance metrics
Export reports: Generate usage reports

Best practices

Set up alerts early: Configure alerts during development
Monitor key metrics: Focus on business-critical metrics
Create dashboards: Visualize important metrics
Review logs regularly: Check logs for issues
Track error trends: Monitor error rates over time
Set up SLOs: Define service level objectives
Document runbooks: Create troubleshooting guides
Regular reviews: Review metrics and alerts periodically

Troubleshooting guide

Use these steps to diagnose and resolve common issues.

High error rate

Check logs: Review error messages in Grafana
Identify pattern: Look for common error types
Check external services: Verify external API health
Review recent changes: Check recent deployments
Scale resources: May need more resources

Slow performance

Check response times: Identify slow operations
Review resource usage: Check CPU and memory
Analyze traces: Find bottlenecks in request flow
Check external dependencies: External services may be slow
Optimize code: Review and optimize slow operations

Memory issues

Check memory usage: Monitor memory consumption
Review heap dumps: Analyze memory usage patterns
Check for leaks: Identify memory leaks
Increase limits: Adjust memory limits if needed
Optimize code: Reduce memory footprint

Connection issues

Check network connectivity: Verify network connections
Review connection pool: Check pool configuration
Monitor connection errors: Track connection failures
Check firewall rules: Verify firewall configurations
Review DNS: Check DNS resolution

Getting help

Grafana support: Check Grafana documentation for usage guides
Team support: Contact your team lead for access issues
Platform support: Create a support ticket for platform issues

Overview

Capabilities

Developer guides

Security and reliability

Platform administration

Reference

​Overview

​Using monitoring tools

​Grafana

​Prerequisites

​Opening Grafana

​Grafana data sources

​Query modes

​Monitoring connector health

​Application logs

​Viewing logs in Grafana

​Log levels

​Application metrics

​Key metrics to monitor

​Viewing metrics in Grafana

​Distributed tracing

​Viewing traces in Grafana

​Request tracing

​Health checks

​Kubernetes health checks

​Application health endpoints

​Error tracking

​Error monitoring

​Alerting

​Setting up alerts in Grafana

​Example alert conditions

​Example PromQL queries for alerts

​Dashboards

​Dashboard folders

​Custom folder dashboards

​Creating custom dashboards

​Creating dashboards in Grafana

​Performance analysis

​Identifying performance issues

​Log archives

​Viewing log archives

​API management monitoring

​APIM analytics

​Best practices

​Troubleshooting guide

​High error rate

​Slow performance

​Memory issues

​Connection issues

​Getting help

Overview

Using monitoring tools

Grafana

Prerequisites

Opening Grafana

Grafana data sources

Query modes

Monitoring connector health

Application logs

Viewing logs in Grafana

Log levels

Application metrics

Key metrics to monitor

Viewing metrics in Grafana

Distributed tracing

Viewing traces in Grafana

Request tracing

Health checks

Kubernetes health checks

Application health endpoints

Error tracking

Error monitoring

Alerting

Setting up alerts in Grafana

Example alert conditions

Example PromQL queries for alerts

Dashboards

Dashboard folders

Custom folder dashboards

Creating custom dashboards

Creating dashboards in Grafana

Performance analysis

Identifying performance issues

Log archives

Viewing log archives

API management monitoring

APIM analytics

Best practices

Troubleshooting guide

High error rate

Slow performance

Memory issues

Connection issues

Getting help