go: downloading k8s.io/apimachinery v0.34.1 go: downloading k8s.io/api v0.34.1 go: downloading sigs.k8s.io/kueue v0.6.2 go: downloading sigs.k8s.io/jobset v0.10.1 go: downloading github.com/matoous/go-nanoid/v2 v2.1.0 go: downloading github.com/prometheus/common v0.67.2 go: downloading github.com/prometheus/client_golang v1.23.2 go: downloading github.com/onsi/gomega v1.38.2 go: downloading github.com/kubeflow/trainer/v2 v2.1.0 go: downloading github.com/minio/minio-go/v7 v7.0.98 go: downloading github.com/kubeflow/training-operator v1.7.0 go: downloading github.com/openshift/api v0.0.0-20251124165233-999c45c0835a go: downloading github.com/openshift/client-go v0.0.0-20251015124057-db0dee36e235 go: downloading github.com/openshift/kueue-operator v0.0.0-20251202204851-958c48004dad go: downloading github.com/operator-framework/api v0.36.0 go: downloading github.com/operator-framework/operator-lifecycle-manager v0.38.0 go: downloading github.com/ray-project/kuberay/ray-operator v1.3.0 go: downloading k8s.io/client-go v0.34.1 go: downloading github.com/prometheus/client_model v0.6.2 go: downloading go.yaml.in/yaml/v2 v2.4.3 go: downloading google.golang.org/protobuf v1.36.10 go: downloading github.com/json-iterator/go v1.1.12 go: downloading github.com/go-ini/ini v1.67.0 go: downloading github.com/dustin/go-humanize v1.0.1 go: downloading github.com/google/uuid v1.6.0 go: downloading github.com/klauspost/compress v1.18.2 go: downloading github.com/klauspost/crc32 v1.3.0 go: downloading github.com/minio/crc64nvme v1.1.1 go: downloading github.com/minio/md5-simd v1.1.2 go: downloading go.yaml.in/yaml/v3 v3.0.4 go: downloading golang.org/x/net v0.48.0 go: downloading github.com/google/go-cmp v0.7.0 go: downloading github.com/gogo/protobuf v1.3.2 go: downloading k8s.io/klog/v2 v2.130.1 go: downloading sigs.k8s.io/randfill v1.0.0 go: downloading github.com/mwitkow/go-conntrack v0.0.0-20190716064945-2f068394615f go: downloading golang.org/x/oauth2 v0.32.0 go: downloading k8s.io/kube-openapi v0.0.0-20250710124328-f3f2b991d03b go: downloading sigs.k8s.io/controller-runtime v0.22.4 go: downloading gopkg.in/inf.v0 v0.9.1 go: downloading k8s.io/utils v0.0.0-20251002143259-bc988d571ff4 go: downloading sigs.k8s.io/yaml v1.6.0 go: downloading sigs.k8s.io/structured-merge-diff/v6 v6.3.0 go: downloading github.com/sirupsen/logrus v1.9.3 go: downloading volcano.sh/apis v1.13.1-0.20251028070205-46d20c0699e7 go: downloading github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd go: downloading github.com/modern-go/reflect2 v1.0.3-0.20250322232337-35a7c28c31ee go: downloading github.com/klauspost/cpuid/v2 v2.2.11 go: downloading golang.org/x/crypto v0.46.0 go: downloading github.com/rs/xid v1.6.0 go: downloading github.com/tinylib/msgp v1.6.1 go: downloading golang.org/x/sys v0.39.0 go: downloading github.com/jpillora/backoff v1.0.0 go: downloading github.com/go-logr/logr v1.4.3 go: downloading sigs.k8s.io/json v0.0.0-20241014173422-cfa47c3a1cc8 go: downloading github.com/blang/semver/v4 v4.0.0 go: downloading github.com/fxamacker/cbor/v2 v2.9.0 go: downloading github.com/beorn7/perks v1.0.1 go: downloading github.com/cespare/xxhash/v2 v2.3.0 go: downloading github.com/prometheus/procfs v0.16.1 go: downloading github.com/emicklei/go-restful/v3 v3.12.2 go: downloading github.com/go-openapi/jsonreference v0.21.0 go: downloading github.com/go-openapi/swag v0.23.1 go: downloading github.com/google/gnostic-models v0.7.0 go: downloading github.com/philhofer/fwd v1.2.0 go: downloading golang.org/x/text v0.32.0 go: downloading github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 go: downloading github.com/go-openapi/jsonpointer v0.21.1 go: downloading github.com/x448/float16 v0.8.4 go: downloading github.com/mailru/easyjson v0.9.0 go: downloading gopkg.in/yaml.v3 v3.0.1 go: downloading github.com/spf13/pflag v1.0.10 go: downloading golang.org/x/term v0.38.0 go: downloading golang.org/x/time v0.14.0 go: downloading gopkg.in/evanphx/json-patch.v4 v4.12.0 go: downloading github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc go: downloading github.com/josharian/intern v1.0.0 go: downloading github.com/pkg/errors v0.9.1 Warning: Failed to get DSC: the server could not find the requested resource Initial Kueue managementState: === RUN TestDefaultTrainingHubRuntimesMatchDefaultClusterRuntimes cluster_training_runtimes_test.go:142: CTR "torch-distributed-cpu-torch291-py312" matches DefaultClusterTrainingRuntime "torch-distributed-cpu" cluster_training_runtimes_test.go:142: CTR "training-hub-rocm" matches DefaultClusterTrainingRuntime "torch-distributed-rocm" cluster_training_runtimes_test.go:142: CTR "torch-distributed-cuda130-torch291-py312" matches DefaultClusterTrainingRuntime "torch-distributed" cluster_training_runtimes_test.go:142: CTR "torch-distributed-rocm64-torch291-py312" matches DefaultClusterTrainingRuntime "torch-distributed-rocm" cluster_training_runtimes_test.go:142: CTR "training-hub" matches DefaultClusterTrainingRuntime "torch-distributed" cluster_training_runtimes_test.go:142: CTR "training-hub-cpu" matches DefaultClusterTrainingRuntime "torch-distributed-cpu" cluster_training_runtimes_test.go:142: CTR "training-hub-th06-cuda130-torch291-py312" matches DefaultClusterTrainingRuntime "torch-distributed" cluster_training_runtimes_test.go:142: CTR "training-hub-th06-cpu-torch291-py312" matches DefaultClusterTrainingRuntime "torch-distributed-cpu" cluster_training_runtimes_test.go:142: CTR "training-hub-th06-rocm64-torch291-py312" matches DefaultClusterTrainingRuntime "torch-distributed-rocm" cluster_training_runtimes_test.go:145: All CTRs match their DefaultClusterTrainingRuntime counterparts! --- PASS: TestDefaultTrainingHubRuntimesMatchDefaultClusterRuntimes (0.01s) === RUN TestRunTrainJobWithDefaultClusterTrainingRuntimes cluster_training_runtimes_test.go:161: Running TrainJob with ClusterTrainingRuntime: torch-distributed cluster_training_runtimes_test.go:167: Created TrainJob test-ns-gskjb/test-trainjob-pb95z successfully cluster_training_runtimes_test.go:178: TrainJob with ClusterTrainingRuntime 'torch-distributed' completed successfully dscInitialization.go:44: Using applications namespace from env var APPLICATIONS_NAMESPACE: kubeflow-trainer-system cluster_training_runtimes_test.go:161: Running TrainJob with ClusterTrainingRuntime: torch-distributed-rocm cluster_training_runtimes_test.go:167: Created TrainJob test-ns-p4l27/test-trainjob-bn2mq successfully cluster_training_runtimes_test.go:178: TrainJob with ClusterTrainingRuntime 'torch-distributed-rocm' completed successfully dscInitialization.go:44: Using applications namespace from env var APPLICATIONS_NAMESPACE: kubeflow-trainer-system cluster_training_runtimes_test.go:161: Running TrainJob with ClusterTrainingRuntime: torch-distributed-cpu cluster_training_runtimes_test.go:167: Created TrainJob test-ns-4ltzp/test-trainjob-jqpl7 successfully cluster_training_runtimes_test.go:178: TrainJob with ClusterTrainingRuntime 'torch-distributed-cpu' completed successfully dscInitialization.go:44: Using applications namespace from env var APPLICATIONS_NAMESPACE: kubeflow-trainer-system cluster_training_runtimes_test.go:161: Running TrainJob with ClusterTrainingRuntime: torch-distributed-cuda128-torch29-py312 cluster_training_runtimes_test.go:167: Created TrainJob test-ns-cwv8s/test-trainjob-nwlks successfully cluster_training_runtimes_test.go:178: TrainJob with ClusterTrainingRuntime 'torch-distributed-cuda128-torch29-py312' completed successfully dscInitialization.go:44: Using applications namespace from env var APPLICATIONS_NAMESPACE: kubeflow-trainer-system cluster_training_runtimes_test.go:161: Running TrainJob with ClusterTrainingRuntime: torch-distributed-rocm64-torch29-py312 cluster_training_runtimes_test.go:167: Created TrainJob test-ns-twqv9/test-trainjob-zkhcw successfully cluster_training_runtimes_test.go:171: Timed out after 1200.001s. Expected <*v1alpha1.TrainJob | 0xc000760340>: { TypeMeta: {Kind: "", APIVersion: ""}, ObjectMeta: { Name: "test-trainjob-zkhcw", GenerateName: "test-trainjob-", Namespace: "test-ns-twqv9", SelfLink: "", UID: "8f3e586b-298f-4aaf-939f-340961b7dc5c", ResourceVersion: "19975", Generation: 1, CreationTimestamp: { Time: 2026-04-22T15:34:53Z, }, DeletionTimestamp: nil, DeletionGracePeriodSeconds: nil, Labels: nil, Annotations: nil, OwnerReferences: nil, Finalizers: nil, ManagedFields: [ { Manager: "manager", Operation: "Update", APIVersion: "trainer.kubeflow.org/v1alpha1", Time: { Time: 2026-04-22T15:34:53Z, }, FieldsType: "FieldsV1", FieldsV1: { Raw: "{\"f:status\":{\".\":{},\"f:jobsStatus\":{\".\":{},\"k:{\\\"name\\\":\\\"node\\\"}\":{\".\":{},\"f:active\":{},\"f:failed\":{},\"f:name\":{},\"f:ready\":{},\"f:succeeded\":{},\"f:suspended\":{}}}}}", }, Subresource: "status", }, { Manager: "trainer.test", Operation: "Update", APIVersion: "trainer.kubeflow.org/v1alpha1", Time: { Time: 2026-04-22T15:34:53Z, }, FieldsType: "FieldsV1", FieldsV1: { Raw: "{\"f:metadata\":{\"f:generateName\":{}},\"f:spec\":{\".\":{},\"f:managedBy\":{},\"f:runtimeRef\":{\".\":{},\"f:apiGroup\":{},\"f:kind\":{},\"f:name\":{}},\"f:suspend\":{},\"f:trainer\":{\".\":{},\"f:command\":{}}}}", }, Subresource: "", }, ], }, Spec: { RuntimeRef: { Name: "torch-distributed-rocm64-torch29-py312", APIGroup: "trainer.kubeflow.org", Kind: "ClusterTrainingRuntime", }, Initializer: nil, Trainer: { Image: nil, Command: [ "python", "-c", "import torch; print(f'PyTorch version: {torch.__version__}'); print('Training completed successfully')", ], Args: nil, Env: nil, NumNodes: nil, ResourcesPerNode: nil, NumProcPerNode: nil, }, Labels: nil, Annotations: nil, PodTemplateOverrides: nil, Suspend: false, ManagedBy: "trainer.kubeflow.org/trainjob-controller", }, Status: { Conditions: nil, JobsStatus: [ {Name: "node", Ready: 0, Succeeded: 0, Failed: 0, Active: 1, Suspended: 0}, ], }, } to satisfy predicate : 0x1c71fe0 test.go:169: Retrieving Pod Container test-ns-twqv9/test-trainjob-zkhcw-node-0-0-hcc9l/node logs test.go:169: Failed to retrieve logs for Pod Container test-ns-twqv9/test-trainjob-zkhcw-node-0-0-hcc9l/node, logs cannot be stored test.go:152: Creating ephemeral output directory as TEST_OUTPUT_DIR env variable is unset test.go:160: Output directory has been created at: /tmp/TestRunTrainJobWithDefaultClusterTrainingRuntimes772723123 test.go:169: Retrieving Pod Container test-ns-cwv8s/test-trainjob-nwlks-node-0-0-s86pr/node logs test.go:169: Retrieving Pod Container test-ns-4ltzp/test-trainjob-jqpl7-node-0-0-nv7mg/node logs test.go:169: Retrieving Pod Container test-ns-p4l27/test-trainjob-bn2mq-node-0-0-gck6z/node logs test.go:169: Retrieving Pod Container test-ns-gskjb/test-trainjob-pb95z-node-0-0-m6hwq/node logs --- FAIL: TestRunTrainJobWithDefaultClusterTrainingRuntimes (2424.82s) FAIL TearDown: Setting kueue managementState to Removed in DataScienceCluster... TearDown: Failed to set Kueue to Removed: TearDown: failed to set kueue to Removed: the server could not find the requested resource ok github.com/opendatahub-io/distributed-workloads/tests/trainer 2424.898s