Skip to content

[Bug]: Service re-run terminates despite available fleet capacity. #3403

@Bihan

Description

@Bihan

Steps to reproduce

Configs:

# my_cpu_fleet.yml
type: fleet
name: cpu-default

nodes: 0..8

resources:
  cpu: 2
# simple-service-replicas.yml
type: service
name: simple-service-replicas
https: false
python: 3.12


commands:
  - echo "Group default - Version 1" > /tmp/version.txt
  - python3 -m http.server 8000

port: 8000

resources:
  cpu: 2

replicas: 5

Step1: Create Fleet: dstack apply -f my_cpu_fleet.yml

Step2: Apply Service Config dstack apply -f simple-service-replicas.yml

The first run works as expected

dstack ps
 NAME                     BACKEND          GPU  PRICE    STATUS   SUBMITTED  
 simple-service-replicas                   -    -        running  5 mins ago 
   replica=0              aws (us-east-2)  -    $0.0832  running  5 mins ago 
   replica=1              aws (us-east-2)  -    $0.0832  running  5 mins ago 
   replica=2              aws (us-east-2)  -    $0.0832  running  5 mins ago 
   replica=3              aws (us-east-2)  -    $0.0832  running  5 mins ago 
   replica=4              aws (us-east-2)  -    $0.0832  running  5 mins ago 
dstack fleet
 FLEET        INSTANCE  BACKEND          RESOURCES                 PRICE    STATUS  CREATED    
 default      -         -                -                         -        -       4 days ago 
 cpu-default  0         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  busy    4 mins ago 
              1         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  busy    4 mins ago 
              2         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  busy    4 mins ago 
              3         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  busy    4 mins ago 
              4         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  busy    4 mins ago 

Step3: Stop the run. fleet instances are idle as expected.

FLEET        INSTANCE  BACKEND          RESOURCES                 PRICE    STATUS  CREATED    
default      -         -                -                         -        -       4 days ago 
cpu-default  0         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  idle    6 mins ago 
             1         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  idle    5 mins ago 
             2         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  idle    5 mins ago 
             3         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  idle    5 mins ago 
             4         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  idle    5 mins ago 

Step4: Once Again apply: dstack apply -f simple-service-replicas.yml

dstack ps
 NAME                     BACKEND          GPU  PRICE    STATUS   SUBMITTED  
 simple-service-replicas                   -    -        running  56 sec ago 
   replica=0              aws (us-east-2)  -    $0.0832  running  55 sec ago 
   replica=1              aws (us-east-2)  -    $0.0832  pulling  55 sec ago 
   replica=2              aws (us-east-2)  -    $0.0832  pulling  55 sec ago 
   replica=3              aws (us-east-2)  -    $0.0832  pulling  55 sec ago 
   replica=4              aws (us-east-2)  -    $0.0832  running  55 sec ago 

All the fleet instances are expected to be busy when replica's are pulling/running, but some are idle as below:

dstack fleet
 FLEET        INSTANCE  BACKEND          RESOURCES                 PRICE    STATUS  CREATED     
 default      -         -                -                         -        -       4 days ago  
 cpu-default  0         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  busy    10 mins ago 
              1         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  idle    10 mins ago 
              2         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  idle    10 mins ago 
              3         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  idle    10 mins ago 
              4         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  busy    10 mins ago 

Step5: Check the run after a while

dstack ps
 NAME                     BACKEND          GPU  PRICE    STATUS       SUBMITTED  
 simple-service-replicas                   -    -        terminating  3 mins ago 
   replica=0              aws (us-east-2)  -    $0.0832  running      3 mins ago 
   replica=1              aws (us-east-2)  -    $0.0832  terminating  3 mins ago 
   replica=2              aws (us-east-2)  -    $0.0832  terminating  3 mins ago 
   replica=3              aws (us-east-2)  -    $0.0832  terminating  3 mins ago 
   replica=4              aws (us-east-2)  -    $0.0832  running      3 mins ago

The run gets terminated.

Actual behaviour

The run gets terminated on re-run even when fleet has idle instances.

dstack fleet
 FLEET        INSTANCE  BACKEND          RESOURCES                 PRICE    STATUS  CREATED     
 default      -         -                -                         -        -       4 days ago  
 cpu-default  0         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  busy    10 mins ago 
              1         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  idle    10 mins ago 
              2         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  idle    10 mins ago 
              3         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  idle    10 mins ago 
              4         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  busy    10 mins ago 
dstack ps
 NAME                     BACKEND          GPU  PRICE    STATUS       SUBMITTED  
 simple-service-replicas                   -    -        terminating  3 mins ago 
   replica=0              aws (us-east-2)  -    $0.0832  running      3 mins ago 
   replica=1              aws (us-east-2)  -    $0.0832  terminating  3 mins ago 
   replica=2              aws (us-east-2)  -    $0.0832  terminating  3 mins ago 
   replica=3              aws (us-east-2)  -    $0.0832  terminating  3 mins ago 
   replica=4              aws (us-east-2)  -    $0.0832  running      3 mins ago

Expected behaviour

The re-run should not be terminated and idle fleet instances should be utilized.

dstack version

master (commit: b2be6a7)

Server logs

Additional information

server_logs_fleet_issue.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingmajor

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions