Overview
The platform provides comprehensive monitoring capabilities through:- Grafana - Dashboards, metrics visualization, and alerting
- Kubernetes - Container and cluster-level monitoring
- Azure APIM - API analytics and usage metrics
Using monitoring tools
Use the monitoring tools to view logs, metrics, and traces for your connectors.Grafana
Use Grafana for dashboards and metrics visualization.Prerequisites
To use Grafana, you need:- Microsoft Entra ID account, formerly Azure AD
-
Grafana roles assigned in
self-service.tfvars:
{env}-ro: Read-only access to logs and dashboards{env}-rw: Full access including creating dashboards, monitors, and alerts
Opening Grafana
- Navigate to:
https://ecos{customername}.grafana.net- Example:
https://ecosecos.grafana.net(whereecosis the customer name)
- Example:
- Sign in with your Microsoft Entra ID credentials
- Select your workspace/environment
Grafana data sources
Grafana is pre-configured with data sources named after your customer. Select the appropriate data source based on what you want to view:| Data source | Purpose | Example (for customer “ecos”) |
|---|---|---|
{customer}-logs | Application and container logs | ecos-logs |
{customer}-metrics | Prometheus metrics (CPU, memory, request rates) | ecos-metrics |
{customer}-traces | Distributed traces (request flows) | ecos-traces |
{customer}-profiles | Continuous profiling data | ecos-profiles |
Query modes
Grafana offers multiple ways to query your data:- Builder mode: Visual query builder with dropdowns and filters - ideal for beginners
- Code mode: Write raw PromQL (metrics) or LogQL (logs) queries - for advanced users
- Drilldown view: Click on any metric, log, or trace to explore related data and navigate between signals
Monitoring connector health
Monitor your connector’s health using logs, metrics, and traces.Application logs
Viewing logs in Grafana
To view application logs in Grafana:- Navigate to Explore in the left sidebar
-
Select the logs data source: Choose
{customer}-logs(for example,ecos-logs) -
Choose your query mode:
- Builder: Use the visual label browser to filter by
namespace,app,container - Code: Write LogQL queries directly
- Builder: Use the visual label browser to filter by
- Use Drilldown view: Click on any log line to see details and related traces
-
Example LogQL queries:
Log levels
Monitor different log levels:- ERROR: Critical issues requiring immediate attention
- WARN: Potential issues or unusual conditions
- INFO: General operational information
- DEBUG: Detailed diagnostic information
Application metrics
Track key performance and resource metrics for your connector.Key metrics to monitor
Performance metrics:- Request rate: Requests per second
- Response time: P50, P95, P99 latencies
- Error rate: Percentage of failed requests
- Throughput: Messages processed per second
- CPU utilization: Current CPU usage
- Memory usage: Memory consumption
- Network I/O: Network traffic
- Disk I/O: Disk read/write operations
- Message processing rate: Messages processed successfully
- API call success rate: External API call success percentage
- Queue depth: Number of pending messages
- Processing time: Time to process each message
Viewing metrics in Grafana
To view and analyze metrics:- Navigate to Explore in the left sidebar
-
Select the metrics data source: Choose
{customer}-metrics(for example,ecos-metrics) -
Choose your query mode:
- Builder: Use the visual metric browser to select metrics and apply filters
- Code: Write PromQL queries directly for advanced queries
- Use Drilldown view: Click on any metric to explore related logs and traces
- Browse Dashboards: Navigate to Dashboards to view pre-built connector dashboards
-
Example PromQL queries:
Distributed tracing
Trace requests as they flow through your connector and external services.Viewing traces in Grafana
To view distributed traces:- Navigate to Explore in the left sidebar
- Select the traces data source: Choose
{customer}-traces(for example,ecos-traces) - Search for traces:
- Builder: Use filters for service name, operation, duration, or status
- Code: Write TraceQL queries for advanced filtering
- Use Drilldown view: Click on any trace to see the full request flow and related logs
- Correlate with logs: From a trace span, click to view associated log entries
Request tracing
Trace requests across services:- View request flow: See how requests traverse through services
- Identify bottlenecks: Find slow operations
- Error tracking: See where errors occur in the flow
Health checks
Verify your connector is running and ready to handle requests.Kubernetes health checks
Monitor connector health at the Kubernetes level:Application health endpoints
Monitor app-level health:- Live endpoint at
/health/live: Indicates whether the app process is running. Kubernetes uses this to decide whether to restart the container. - Ready endpoint at
/health/ready: Indicates whether the app can accept traffic. Kubernetes uses this to decide whether to route requests to the pod. - Metrics endpoint at
/metrics: Exposes app metrics in Prometheus format for monitoring and alerting.
Error tracking
Track and analyze errors to identify and resolve issues.Error monitoring
Track and analyze errors: Common error types:- Connection errors: External API connectivity issues
- Timeout errors: Slow or unresponsive services
- Authentication errors: Invalid credentials or tokens
- Validation errors: Invalid data formats
- Resource errors: Memory or CPU exhaustion
Alerting
Configure alerts in Grafana to receive notifications when issues occur.Setting up alerts in Grafana
To create an alert rule in Grafana:- Navigate to Alerting > Alert rules in the left sidebar
- Click “New alert rule” to create a new rule
- Define the query:
- Select your data source (Prometheus, Loki, etc.)
- Write the query for the metric you want to monitor
- Set the threshold condition (for example,
> 0.05for 5% error rate)
- Set evaluation behavior:
- Choose or create a folder for the rule
- Set the evaluation interval (for example, every 1 minute)
- Set the pending period before firing (for example, 5 minutes)
- Add labels and annotations:
- Add labels to categorize alerts
- Add annotations for alert details and runbook links
- Configure notifications:
- Link to a contact point (email, Slack, PagerDuty, etc.)
- Set notification policy if needed
- Save the alert rule
Example alert conditions
| Metric | Condition | Duration | Severity |
|---|---|---|---|
| Error rate | > 5% | 5 minutes | Critical |
| Response time P95 | > 1000ms | 10 minutes | Warning |
| CPU usage | > 80% | 5 minutes | Warning |
| Memory usage | > 90% | 5 minutes | Critical |
Example PromQL queries for alerts
Dashboards
View pre-built dashboards or create custom dashboards to visualize key metrics and monitor connector health.Dashboard folders
Grafana dashboards are organized into two main folders:| Folder | Purpose | Example dashboards |
|---|---|---|
| ecos | Platform infrastructure and components | Kubernetes cluster metrics, ArgoCD deployments, Knative services, Istio service mesh |
| custom | Application-level dashboards for your connectors | Request counts, response times (P50, P90, P95, P99), error rates, throughput |
- Navigate to Dashboards in the left sidebar
- Select a folder (
ecosorcustom) - Click on a dashboard to view it
Custom folder dashboards
Thecustom folder contains application dashboards that show:
- Request metrics: Total requests, requests per second
- Latency percentiles: P50, P90, P95, P99 response times
- Error rates: 4xx and 5xx error percentages
- Throughput: Messages processed per second
Creating custom dashboards
Recommended dashboard panels:- Request rate over time
- Error rate percentage
- Response time percentiles
- CPU and memory usage
- Active connections
- Queue depth
- Recent errors log
Creating dashboards in Grafana
- Navigate to Dashboards > New > New Dashboard
- Add a visualization and select your panel type:
- Time series: For metrics over time
- Stat: For single value displays
- Logs: For log panels
- Table: For tabular data
- Configure the query:
- Select the data source (Prometheus for metrics, Loki for logs)
- Write your PromQL or LogQL query
- Customize the panel:
- Set title and description
- Configure thresholds and colors
- Adjust axes and legends
- Save the dashboard with a descriptive name
Performance analysis
Analyze performance data to identify and resolve bottlenecks.Identifying performance issues
Slow response times:- Check external API latency: External services may be slow
- Review database queries: Optimize slow queries
- Analyze thread pool usage: May need more threads
- Check resource constraints: CPU or memory limits
- Review error logs: Identify error patterns
- Check external service health: External APIs may be down
- Validate input data: Ensure data format is correct
- Review authentication: Check token expiration
- Monitor memory usage: Check for memory leaks
- Review CPU usage: Optimize CPU-intensive operations
- Check connection pools: May need to increase pool size
- Review thread usage: Optimize thread pool configuration
Log archives
View archived logs for long-term analysis and compliance.Viewing log archives
For long-term log storage and analysis: Prerequisites:- Azure role: Log archive access role assigned
- Storage account access: Azure Storage Account credentials
- Key Vault access: For SPN credentials (automation)
- Azure Portal: Browse storage account
- Azure Storage Explorer: Desktop application
- Azure CLI: Command-line interface
- AzCopy: High-performance copying
- Compliance: Long-term log retention
- Forensics: Investigating historical issues
- Analytics: Analyzing trends over time
- Auditing: Security and compliance audits
API management monitoring
Monitor API usage and performance through Azure API Management.APIM analytics
Monitor API usage through Azure APIM: Available metrics:- Request count: Total API requests
- Response time: API response times
- Error rate: Failed API calls
- Bandwidth: Data transfer volume
- Subscription usage: Per-subscription metrics
- Navigate to Azure Portal
- Select APIM instance for your environment
- View Analytics: See usage and performance metrics
- Export reports: Generate usage reports
Best practices
- Set up alerts early: Configure alerts during development
- Monitor key metrics: Focus on business-critical metrics
- Create dashboards: Visualize important metrics
- Review logs regularly: Check logs for issues
- Track error trends: Monitor error rates over time
- Set up SLOs: Define service level objectives
- Document runbooks: Create troubleshooting guides
- Regular reviews: Review metrics and alerts periodically
Troubleshooting guide
Use these steps to diagnose and resolve common issues.High error rate
- Check logs: Review error messages in Grafana
- Identify pattern: Look for common error types
- Check external services: Verify external API health
- Review recent changes: Check recent deployments
- Scale resources: May need more resources
Slow performance
- Check response times: Identify slow operations
- Review resource usage: Check CPU and memory
- Analyze traces: Find bottlenecks in request flow
- Check external dependencies: External services may be slow
- Optimize code: Review and optimize slow operations
Memory issues
- Check memory usage: Monitor memory consumption
- Review heap dumps: Analyze memory usage patterns
- Check for leaks: Identify memory leaks
- Increase limits: Adjust memory limits if needed
- Optimize code: Reduce memory footprint
Connection issues
- Check network connectivity: Verify network connections
- Review connection pool: Check pool configuration
- Monitor connection errors: Track connection failures
- Check firewall rules: Verify firewall configurations
- Review DNS: Check DNS resolution
Getting help
- Grafana support: Check Grafana documentation for usage guides
- Team support: Contact your team lead for access issues
- Platform support: Create a support ticket for platform issues