Horizons: The OmniChat

A flexible and powerful chatbot platform that brings enterprise-grade LLM capabilities to your infrastructure.

View project on GitHub

Monitoring and Observability

Effective monitoring is crucial for maintaining a healthy and responsive Horizons OmniChat deployment. This guide explains our comprehensive approach to monitoring, helping you understand not just what to monitor, but why and how to respond to different scenarios.

Understanding Horizons Monitoring

Monitoring in Horizons isn’t just about watching metrics - it’s about gaining insights into your deployment’s health, performance, and security. Our monitoring strategy covers multiple layers of the application stack, ensuring you have complete visibility into your system’s operation.

System Health Monitoring

Your first line of defense is understanding the basic health of your system components. We implement comprehensive health checks across all services:

Component Health Checks

The foundation of our monitoring starts with basic service health:

# Check WebUI health
curl http://localhost:3002/health

# Verify Ollama status
curl http://localhost:11434/api/tags

# Monitor database connectivity
docker exec webui-db pg_isready

These checks provide immediate insight into service availability and basic functionality.

Performance Monitoring

Understanding performance goes beyond simple up/down status. We track several key metrics that indicate the overall health and efficiency of your deployment:

Key Performance Indicators (KPIs)

  1. Response Times
    • Model inference latency
    • API request duration
    • Database query performance
    • Network latency between components
  2. Resource Utilization
    • CPU usage patterns
    • Memory consumption
    • Storage utilization
    • Network bandwidth
  3. Application Metrics (ENTERPRISE)
    • Active user sessions
    • Request rates
    • Model usage statistics
    • Error rates

Proactive Monitoring (ENTERPRISE)

Prevention is better than cure. Our proactive monitoring approach helps you identify potential issues before they impact your users:

Early Warning Systems

We implement various early warning mechanisms:

  1. Resource Trending
    • Monitor resource usage patterns
    • Identify growing trends
    • Predict capacity needs
    • Alert on approaching thresholds
  2. Performance Degradation Detection
    • Baseline performance metrics
    • Track deviation from normal
    • Identify slow degradation
    • Alert on significant changes

Deployment-Specific Monitoring

Different deployment modes have unique monitoring needs. Here’s how we address each:

Local Mode Monitoring

For local deployments, we focus on container and resource monitoring:

# Monitor container resources
docker stats

# Check system resources
top
htop  # for interactive view
nvidia-smi  # for GPU monitoring

AWS Mode Monitoring

In AWS deployments, we leverage cloud-native monitoring tools:

  1. CloudWatch Integration
    • Custom metrics
    • Log aggregation
    • Alarm configuration (ENTERPRISE)
    • Dashboard creation (ENTERPRISE)
  2. Infrastructure Monitoring
    • ECS service health
    • RDS performance
    • Network metrics
    • Auto-scaling events

Alert Management (ENTERPRISE)

Our alert management strategy ensures the right people get the right information at the right time.

Alert Configuration

We implement a tiered alert system with the following:

  1. Critical Alerts
    • Service outages
    • Security incidents
    • Data loss risks
    • Performance emergencies
  2. Warning Alerts
    • Resource thresholds
    • Performance degradation
    • Error rate increases
    • Capacity warnings
  3. Information Alerts
    • Routine maintenance
    • System updates
    • Usage statistics
    • Trend reports

Next Steps

To implement comprehensive monitoring in your deployment:

  1. Review the Security Monitoring guide
  2. Configure Backup Monitoring
  3. Set up Performance Alerts
  4. Establish Incident Response

Horizons OmniChat by evereven