We currently have Splunk as monitoring solutions in my workplace, any recommendations\suggestions for KPIs that are specific to database monitoring? (ORACLE&MS SQL)

3.8k viewscircle icon4 Upvotescircle icon1 Comment
Sort by:
Sr. Database Administrator in Insurance (except health)5 months ago

For Oracle and MS SQL Server environments monitored through Splunk, I’d recommend structuring KPIs around these main categories: performance, availability, resource and operational efficiency.

1. Performance KPIs

Query Response Time / Average Execution Time – Tracks slow queries or workloads that may require tuning.
Top N Queries by CPU/IO Consumption – Helps isolate queries that consume disproportionate resources.
Wait Events (Oracle) / Wait Statistics (SQL Server) – Identifies contention points (e.g., buffer cache, locks, I/O).
Transactions per Second (TPS) – Baseline throughput for measuring system health.

2. Availability & Reliability KPIs

Database Uptime / Connectivity Success Rate – Ensures databases are accessible to applications.
Failed Logins / Authentication Errors – Key for both security and operational availability.
Replication / Log Shipping Lag (SQL Server) and Data Guard Apply Lag (Oracle) – Ensures standby/DR databases are in sync.
Backup & Restore Success Rates – Critical compliance and recovery metric.

3. Resource & Capacity KPIs

CPU and Memory Utilization (per Instance) – With thresholds and anomaly detection.
Buffer Cache Hit Ratio (Oracle) and Page Life Expectancy (SQL Server) – Good indicators of memory efficiency.
Storage Consumption & Growth Rate (Tablespace / Datafiles) – Forecast capacity issues early.
TempDB Usage (SQL Server) / Temporary Tablespace Usage (Oracle) – Detects spikes in sorting and temp operations.

4. Operational KPIs

Job/ETL Completion Times – Monitors scheduled tasks for overruns.
Blocking Sessions / Deadlocks – Flags when concurrency issues impact applications.
Alert Closure SLA – How quickly critical DB alerts are acknowledged and resolved.

Splunk dashboards should baseline these KPIs and leverage anomaly detection rather than static thresholds alone. For example, a sudden 40% rise in “average query execution time” compared to the last 30 days can be more meaningful than just crossing a fixed ms threshold.

Content you might like

Yes39%

Likely43%

Not likely14%

No2%

View Results

Yes, active exploring alternatives74%

No, not considered as risk22%

Seeking expert / legal advice, but no rush3%

View Results