Yes39%
Likely43%
Not likely14%
No2%
Yes, active exploring alternatives74%
No, not considered as risk22%
Seeking expert / legal advice, but no rush3%
No selling.
No recruiting.
No self promotion.
Rules of EngagementFAQsPrivacy
© 2026 Gartner, Inc. and/or its affiliates. All rights reserved.
For Oracle and MS SQL Server environments monitored through Splunk, I’d recommend structuring KPIs around these main categories: performance, availability, resource and operational efficiency.
1. Performance KPIs
Query Response Time / Average Execution Time – Tracks slow queries or workloads that may require tuning.
Top N Queries by CPU/IO Consumption – Helps isolate queries that consume disproportionate resources.
Wait Events (Oracle) / Wait Statistics (SQL Server) – Identifies contention points (e.g., buffer cache, locks, I/O).
Transactions per Second (TPS) – Baseline throughput for measuring system health.
2. Availability & Reliability KPIs
Database Uptime / Connectivity Success Rate – Ensures databases are accessible to applications.
Failed Logins / Authentication Errors – Key for both security and operational availability.
Replication / Log Shipping Lag (SQL Server) and Data Guard Apply Lag (Oracle) – Ensures standby/DR databases are in sync.
Backup & Restore Success Rates – Critical compliance and recovery metric.
3. Resource & Capacity KPIs
CPU and Memory Utilization (per Instance) – With thresholds and anomaly detection.
Buffer Cache Hit Ratio (Oracle) and Page Life Expectancy (SQL Server) – Good indicators of memory efficiency.
Storage Consumption & Growth Rate (Tablespace / Datafiles) – Forecast capacity issues early.
TempDB Usage (SQL Server) / Temporary Tablespace Usage (Oracle) – Detects spikes in sorting and temp operations.
4. Operational KPIs
Job/ETL Completion Times – Monitors scheduled tasks for overruns.
Blocking Sessions / Deadlocks – Flags when concurrency issues impact applications.
Alert Closure SLA – How quickly critical DB alerts are acknowledged and resolved.
Splunk dashboards should baseline these KPIs and leverage anomaly detection rather than static thresholds alone. For example, a sudden 40% rise in “average query execution time” compared to the last 30 days can be more meaningful than just crossing a fixed ms threshold.