Configure AutoScaling for AI Workloads
Step 1. Enable AutoScaling
Add Pod AutoScaling Annotations
To be used in conjunction with workload annotations:Create Workload
yaml
# Enable vertical scaling
tensor-fusion.ai/auto-resources: 'true'
# Configure target resource, options: all|tflops|vram, if empty only provides recommendations
tensor-fusion.ai/auto-scale-target-resource: all
# Enable horizontal scaling
tensor-fusion.ai/auto-replicas: 'true'Detailed Configuration Using Workload Configuration File
- Vertical Scaling: Based on historical GPU resource usage data, the community VPA Histogram algorithm is employed. The estimates generated by the VPA algorithm consist of Target, LowerBound, and UpperBound, corresponding by default to P90, P50, and P95 usage levels. If the current resource usage falls outside the LowerBound and UpperBound range, a recommended value is generated.
Note: If enable is not set to true, or if targetResource is empty, only resource recommendations will be generated, and the recommended values will not be applied in practice.
- Cron Scaling: Based on standard cron expressions, scaling takes effect when
enableistrueand within thestartandendtime range. Outside this range, resources revert to the values specified when the workload was added. Cron Expression Reference
yaml
autoScalingConfig:
# Vertical scaling configuration
autoSetResources:
# Enable/disable
enable: true
# Target resource
targetResource: all
# Tflops usage percentile that will be used as a base for tflops target recommendation. Default: 0.9
targetTflopsPercentile: 0.9
# Tflops usage percentile that will be used for the lower bound on tflops recommendation. Default: 0.5
lowerBoundTflopsPercentile: 0.5
# Tflops usage percentile that will be used for the upper bound on tflops recommendation. Default: 0.95
upperBoundTflopsPercentile: 0.95
# Vram usage percentile that will be used as a base for vram target recommendation. Default: 0.9
targetVramPercentile: 0.9
# Vram usage percentile that will be used for the lower bound on vram recommendation. Default: 0.5
lowerBoundVramPercentile: 0.5
# Vram usage percentile that will be used for the upper bound on vram recommendation. Default: 0.95
upperBoundVramPercentile: 0.95
# Fraction of usage added as the safety margin to the recommended request. Default: 0.15
requestMarginFraction: 0.15
# The time interval used for computing the confidence multiplier for the lower and upper bound. Default: 24h
confidenceInterval: 24h
autoSetReplicas: {}
# Cron-based scaling configuration
cronScalingRules:
# Enable/disable this rule
- enable: True
# Rule name
name: "test"
# Rule start time
start: "0 0 * * Thu"
# Rule end time
end: "59 23 * * Thu"
# Desired GPU resource
desiredResources:
limits:
tflops: "99"
vram: 10Gi
requests:
tflops: "44"
vram: 5GiStep 2. Monitor Scaling Status
The workload generates a corresponding
TensorFusionWorkloadresource object, and the fields inStatusreflect the current scaling status in real time.
yaml
status:
conditions:
# Reason for GPU resource recommendation
- lastTransitionTime: '2025-10-09T09:16:46Z'
message: TFLOPS scaled up due to (1) below lower bound (2)
reason: OutOfEstimatedBound
status: 'True'
type: RecommendationProvided
# Current GPU resource recommendations
recommendation:
limits:
tflops: '13'
vram: 1Gi
requests:
tflops: '13'
vram: 1Gi
# Number of replicas with applied GPU resource recommendations
appliedRecommendedReplicas: 3
# Currently active cron scaling rule
activeCronScalingRule: <...>