-
Notifications
You must be signed in to change notification settings - Fork 208
Open
Labels
Description
Steps to reproduce
Configs:
# my_cpu_fleet.yml
type: fleet
name: cpu-default
nodes: 0..8
resources:
cpu: 2
# simple-service-replicas.yml
type: service
name: simple-service-replicas
https: false
python: 3.12
commands:
- echo "Group default - Version 1" > /tmp/version.txt
- python3 -m http.server 8000
port: 8000
resources:
cpu: 2
replicas: 5
Step1: Create Fleet: dstack apply -f my_cpu_fleet.yml
Step2: Apply Service Config dstack apply -f simple-service-replicas.yml
The first run works as expected
dstack ps
NAME BACKEND GPU PRICE STATUS SUBMITTED
simple-service-replicas - - running 5 mins ago
replica=0 aws (us-east-2) - $0.0832 running 5 mins ago
replica=1 aws (us-east-2) - $0.0832 running 5 mins ago
replica=2 aws (us-east-2) - $0.0832 running 5 mins ago
replica=3 aws (us-east-2) - $0.0832 running 5 mins ago
replica=4 aws (us-east-2) - $0.0832 running 5 mins ago
dstack fleet
FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED
default - - - - - 4 days ago
cpu-default 0 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 4 mins ago
1 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 4 mins ago
2 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 4 mins ago
3 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 4 mins ago
4 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 4 mins ago
Step3: Stop the run. fleet instances are idle as expected.
FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED
default - - - - - 4 days ago
cpu-default 0 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 6 mins ago
1 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 5 mins ago
2 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 5 mins ago
3 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 5 mins ago
4 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 5 mins ago
Step4: Once Again apply: dstack apply -f simple-service-replicas.yml
dstack ps
NAME BACKEND GPU PRICE STATUS SUBMITTED
simple-service-replicas - - running 56 sec ago
replica=0 aws (us-east-2) - $0.0832 running 55 sec ago
replica=1 aws (us-east-2) - $0.0832 pulling 55 sec ago
replica=2 aws (us-east-2) - $0.0832 pulling 55 sec ago
replica=3 aws (us-east-2) - $0.0832 pulling 55 sec ago
replica=4 aws (us-east-2) - $0.0832 running 55 sec ago
All the fleet instances are expected to be busy when replica's are pulling/running, but some are idle as below:
dstack fleet
FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED
default - - - - - 4 days ago
cpu-default 0 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 10 mins ago
1 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 10 mins ago
2 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 10 mins ago
3 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 10 mins ago
4 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 10 mins ago
Step5: Check the run after a while
dstack ps
NAME BACKEND GPU PRICE STATUS SUBMITTED
simple-service-replicas - - terminating 3 mins ago
replica=0 aws (us-east-2) - $0.0832 running 3 mins ago
replica=1 aws (us-east-2) - $0.0832 terminating 3 mins ago
replica=2 aws (us-east-2) - $0.0832 terminating 3 mins ago
replica=3 aws (us-east-2) - $0.0832 terminating 3 mins ago
replica=4 aws (us-east-2) - $0.0832 running 3 mins ago
The run gets terminated.
Actual behaviour
The run gets terminated on re-run even when fleet has idle instances.
dstack fleet
FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED
default - - - - - 4 days ago
cpu-default 0 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 10 mins ago
1 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 10 mins ago
2 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 10 mins ago
3 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 10 mins ago
4 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 10 mins ago
dstack ps
NAME BACKEND GPU PRICE STATUS SUBMITTED
simple-service-replicas - - terminating 3 mins ago
replica=0 aws (us-east-2) - $0.0832 running 3 mins ago
replica=1 aws (us-east-2) - $0.0832 terminating 3 mins ago
replica=2 aws (us-east-2) - $0.0832 terminating 3 mins ago
replica=3 aws (us-east-2) - $0.0832 terminating 3 mins ago
replica=4 aws (us-east-2) - $0.0832 running 3 mins ago
Expected behaviour
The re-run should not be terminated and idle fleet instances should be utilized.
dstack version
master (commit: b2be6a7)