1. Tổng quan.
Lỗi CEPHADM_FAILED_DAEMON xuất hiện khi một hoặc nhiều daemon của Ceph bị lỗi và không thể khởi động hoặc duy trì trạng thái hoạt động ổn định. Trong thông báo lỗi trên, các daemon node-exporter
trên nhiều node khác nhau đang gặp sự cố.

Nguyên nhân phổ biến:
- Dịch vụ daemon bị crash hoặc không khởi động được.
- Thiếu thư viện hoặc file cần thiết.
- Vấn đề về quyền hạn hoặc file permission.
- Lỗi trong quá trình nâng cấp hoặc cập nhật Ceph.
- Sự cố về network hoặc config host.
- Thiếu tài nguyên hệ thống như CPU, RAM hoặc dung lượng ổ đĩa.
2. Kiểm tra trạng thái daemon gặp lỗi
Bước 1: Xác định daemon gặp lỗi
Trước tiên, chạy lệnh sau để xem danh sách daemon bị lỗi:
ceph health detail
Ví dụ ở dưới, gần như tất cả daemon của node-exporter ở các node đã gặp lỗi.
HEALTH_WARN 5 failed cephadm daemon(s)
[WRN] CEPHADM_FAILED_DAEMON: 5 failed cephadm daemon(s)
daemon node-exporter.CEPH-LAB-MON-071 on CEPH-LAB-MON-071 is in error state
daemon node-exporter.CEPH-LAB-MON-072 on CEPH-LAB-MON-072 is in error state
daemon node-exporter.CEPH-LAB-MON-073 on CEPH-LAB-MON-073 is in error state
daemon node-exporter.CEPH-LAB-OSD-074 on CEPH-LAB-OSD-074 is in error state
daemon node-exporter.CEPH-LAB-OSD-075 on CEPH-LAB-OSD-075 is in error state
daemon node-exporter.CEPH-LAB-OSD-076 on CEPH-LAB-OSD-076 is in error state
Bước 2: Kiểm tra trạng thái chi tiết của daemon
Chạy lệnh sau để kiểm tra trạng thái chi tiết của daemon cụ thể:
shell> ceph orch ps --daemon-type node-exporter
NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID
node-exporter.CEPH-LAB-MON-071 CEPH-LAB-MON-071 *:9100 error 5m ago 49m - - <unknown> <unknown> <unknown>
node-exporter.CEPH-LAB-MON-072 CEPH-LAB-MON-072 *:9100 error 5m ago 18m - - <unknown> <unknown> <unknown>
node-exporter.CEPH-LAB-MON-073 CEPH-LAB-MON-073 *:9100 error 5m ago 18m - - <unknown> <unknown> <unknown>
node-exporter.CEPH-LAB-OSD-074 CEPH-LAB-OSD-074 *:9100 error 5m ago 14m - - <unknown> <unknown> <unknown>
node-exporter.CEPH-LAB-OSD-075 CEPH-LAB-OSD-075 *:9100 error 4m ago 14m - - <unknown> <unknown> <unknown>
node-exporter.CEPH-LAB-OSD-076 CEPH-LAB-OSD-076 *:9100 error 4m ago 14m - - <unknown> <unknown> <unknown>
Hoặc để xem toàn bộ daemon bị lỗi:
shell> ceph orch ps --format json-pretty | grep -B5 'error'
"ports": [
9100
],
"service_name": "node-exporter",
"status": -1,
"status_desc": "error"
--
"ports": [
9100
],
"service_name": "node-exporter",
"status": -1,
"status_desc": "error"
--
"ports": [
9100
],
"service_name": "node-exporter",
"status": -1,
"status_desc": "error"
--
"ports": [
9100
],
"service_name": "node-exporter",
"status": -1,
"status_desc": "error"
--
"ports": [
9100
],
"service_name": "node-exporter",
"status": -1,
"status_desc": "error"
3. Khắc phục lỗi daemon bị thất bại
Cách 1: Thử restart daemon
Có thể check trước status của daemon.
shell> systemctl status ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-MON-071.service
○ ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-MON-071.service - Ceph node-exporter.CEPH-LAB-MON-071 for 75ac298c-0653-11f0-a2e7-2b96c52a296a
Loaded: loaded (/etc/systemd/system/ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Nếu daemon chỉ bị lỗi tạm thời, bạn có thể thử restart nó:
systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-071.service
Check lại status của daemon sau khi khởi động xong
shell> systemctl status ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-MON-071.service
● ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-MON-071.service - Ceph node-exporter.CEPH-LAB-MON-071 for 75ac298c-0653-11f0-a2e7-2b96c52a296a
Loaded: loaded (/etc/systemd/system/ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2025-03-21 20:47:02 +07; 1min 2s ago
Process: 36853 ExecStartPre=/bin/rm -f /run/ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-MON-071.service-pid /run/ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-MON-071.service-cid (code=exited, >
Process: 36854 ExecStart=/bin/bash /var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/node-exporter.CEPH-LAB-MON-071/unit.run (code=exited, status=0/SUCCESS)
Main PID: 37042 (conmon)
Tasks: 8 (limit: 9830)
Memory: 7.6M
CPU: 383ms
CGroup: /system.slice/system-ceph\x2d75ac298c\x2d0653\x2d11f0\x2da2e7\x2d2b96c52a296a.slice/ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-MON-071.service
├─libpod-payload-9d85503a97ca590359a5f3b742df1f19b20f60dcbc0e8f35ca822d881b519484
│ ├─37045 /dev/init -- /bin/node_exporter --no-collector.timex --web.listen-address=:9100 --path.procfs=/host/proc --path.sysfs=/host/sys --path.rootfs=/rootfs
│ └─37047 /bin/node_exporter --no-collector.timex --web.listen-address=:9100 --path.procfs=/host/proc --path.sysfs=/host/sys --path.rootfs=/rootfs
└─supervisor
└─37042 /usr/bin/conmon --api-version 1 -c 9d85503a97ca590359a5f3b742df1f19b20f60dcbc0e8f35ca822d881b519484 -u 9d85503a97ca590359a5f3b742df1f19b20f60dcbc0e8f35ca822d881b519484 -r /usr/bin/crun -b /var/lib/containers/storag>
Hoặc restart tất cả daemon bị lỗi:
systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-071.service
systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-072.service
systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-073.service
systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-074.service
systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-075.service
systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-076.service
Cách 2: Xóa daemon và deploy lại
Nếu restart không giúp khắc phục lỗi, bạn có thể thử xóa daemon và deploy lại:
shell> ceph orch rm node-exporter --force
Removed service node-exporter
shell> ceph orch apply node-exporter
Scheduled node-exporter update...
Cách 3: Kiểm tra logs để tìm nguyên nhân gốc
Sau một hồi tra logs thì mình nhận thấy lỗi do sai url image.
shell> ceph health detail
HEALTH_WARN Failed to place 2 daemon(s); 3 failed cephadm daemon(s)
[WRN] CEPHADM_DAEMON_PLACE_FAIL: Failed to place 2 daemon(s)
Failed while placing node-exporter.CEPH-LAB-OSD-076 on CEPH-LAB-OSD-076: cephadm exited with an error code: 1, stderr: Non-zero exit code 125 from /usr/bin/podman container inspect --format {{.State.Status}} ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-node-exporter-CEPH-LAB-OSD-076
/usr/bin/podman: stderr Error: error inspecting object: no such container ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-node-exporter-CEPH-LAB-OSD-076
Non-zero exit code 125 from /usr/bin/podman container inspect --format {{.State.Status}} ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-node-exporter.CEPH-LAB-OSD-076
/usr/bin/podman: stderr Error: error inspecting object: no such container ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-node-exporter.CEPH-LAB-OSD-076
Deploy daemon node-exporter.CEPH-LAB-OSD-076 ...
Verifying port 0.0.0.0:9100 ...
Non-zero exit code 1 from systemctl start ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-076
systemctl: stderr Job for ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-076.service failed because the control process exited with error code.
systemctl: stderr See "systemctl status ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-076.service" and "journalctl -xeu ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-076.service" for details.
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 10889, in <module>
File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 10877, in main
File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 6786, in command_deploy_from
File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 6804, in _common_deploy
File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 6856, in _dispatch_deploy
File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 3964, in deploy_daemon
File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 4207, in deploy_daemon_units
File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 2266, in call_throws
RuntimeError: Failed command: systemctl start ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-076: Job for ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-076.service failed because the control process exited with error code.
See "systemctl status ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-076.service" and "journalctl -xeu ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-076.service" for details.
Failed while placing node-exporter.CEPH-LAB-OSD-075 on CEPH-LAB-OSD-075: cephadm exited with an error code: 1, stderr: Non-zero exit code 125 from /usr/bin/podman container inspect --format {{.State.Status}} ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-node-exporter-CEPH-LAB-OSD-075
/usr/bin/podman: stderr Error: error inspecting object: no such container ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-node-exporter-CEPH-LAB-OSD-075
Non-zero exit code 125 from /usr/bin/podman container inspect --format {{.State.Status}} ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-node-exporter.CEPH-LAB-OSD-075
/usr/bin/podman: stderr Error: error inspecting object: no such container ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-node-exporter.CEPH-LAB-OSD-075
Deploy daemon node-exporter.CEPH-LAB-OSD-075 ...
Verifying port 0.0.0.0:9100 ...
Non-zero exit code 1 from systemctl start ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-075
systemctl: stderr Job for ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-075.service failed because the control process exited with error code.
systemctl: stderr See "systemctl status ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-075.service" and "journalctl -xeu ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-075.service" for details.
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 10889, in <module>
File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 10877, in main
File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 6786, in command_deploy_from
File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 6804, in _common_deploy
File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 6856, in _dispatch_deploy
File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 3964, in deploy_daemon
File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 4207, in deploy_daemon_units
File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 2266, in call_throws
RuntimeError: Failed command: systemctl start ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-075: Job for ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-075.service failed because the control process exited with error code.
See "systemctl status ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-075.service" and "journalctl -xeu ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-075.service" for details.
[WRN] CEPHADM_FAILED_DAEMON: 3 failed cephadm daemon(s)
daemon node-exporter.CEPH-LAB-MON-072 on CEPH-LAB-MON-072 is in unknown state
daemon node-exporter.CEPH-LAB-MON-073 on CEPH-LAB-MON-073 is in unknown state
daemon node-exporter.CEPH-LAB-OSD-074 on CEPH-LAB-OSD-074 is in unknown state
Cách giải quyết là mình đã cập nhật lại url image như dưới.
# Set ceph_repositort_host and ceph_repositort_port
ceph_repositort_host="10.237.7.74"
ceph_repositort_port="5000"
# Config use images in cluster
ceph config set mgr mgr/cephadm/container_image_prometheus ${ceph_repositort_host}:${ceph_repositort_port}/prometheus/prometheus:v2.43.0
ceph config set mgr mgr/cephadm/container_image_grafana ${ceph_repositort_host}:${ceph_repositort_port}/ceph/ceph-grafana:8.3.5
ceph config set mgr mgr/cephadm/container_image_alertmanager ${ceph_repositort_host}:${ceph_repositort_port}/prometheus/alertmanager:v0.25.0
ceph config set mgr mgr/cephadm/container_image_node_exporter ${ceph_repositort_host}:${ceph_repositort_port}/prometheus/node-exporter:v1.5.0
Verify lại thông tin url image sau khi cập nhật.
shell> ceph config get mgr
WHO MASK LEVEL OPTION VALUE RO
global advanced cluster_network 10.237.7.0/24 *
global basic container_image 10.237.7.74:5000/ceph/ceph@sha256:479f0db9298e37defcdedb1edb8b8db25dd0b934afd6b409d610a7ed81648dbc *
mgr advanced mgr/cephadm/container_image_alertmanager 10.237.7.74:5000/prometheus/alertmanager:v0.25.0 *
mgr advanced mgr/cephadm/container_image_grafana 10.237.7.74:5000/ceph/ceph-grafana:8.3.5 *
mgr advanced mgr/cephadm/container_image_node_exporter 10.237.7.74:5000/prometheus/node-exporter:v1.5.0 *
mgr advanced mgr/cephadm/container_image_prometheus 10.237.7.74:5000/prometheus/prometheus:v2.43.0 *
mgr advanced mgr/cephadm/container_init True *
mgr advanced mgr/cephadm/migration_current 6 *
mgr advanced mgr/dashboard/ALERTMANAGER_API_HOST http://CEPH-LAB-MON-071:9093 *
mgr advanced mgr/dashboard/GRAFANA_API_SSL_VERIFY false *
mgr advanced mgr/dashboard/GRAFANA_API_URL https://CEPH-LAB-MON-071:3000 *
mgr advanced mgr/dashboard/PROMETHEUS_API_HOST http://CEPH-LAB-MON-071:9095 *
mgr advanced mgr/dashboard/ssl_server_port 8443 *
Hoặc ngắn gọn hơn.
shell> ceph config dump | grep mgr/cephadm
mgr advanced mgr/cephadm/container_image_alertmanager 10.237.7.74:5000/prometheus/alertmanager:v0.25.0 *
mgr advanced mgr/cephadm/container_image_grafana 10.237.7.74:5000/ceph/ceph-grafana:8.3.5 *
mgr advanced mgr/cephadm/container_image_node_exporter 10.237.7.74:5000/prometheus/node-exporter:v1.5.0 *
mgr advanced mgr/cephadm/container_image_prometheus 10.237.7.74:5000/prometheus/prometheus:v2.43.0 *
mgr advanced mgr/cephadm/container_init True *
mgr advanced mgr/cephadm/migration_current 6 *
Sau đó thực hiện lại các bước trên nếu bạn muốn chúng cập nhật liền, ví dụ.
systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-071.service
systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-072.service
systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-073.service
systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-074.service
systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-075.service
systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-076.service
Hoặc.
shell> ceph orch rm node-exporter --force
Removed service node-exporter
shell> ceph orch apply node-exporter
Scheduled node-exporter update...
Cách 4: Kiểm tra tài nguyên hệ thống
Nếu server thiếu tài nguyên, daemon có thể không khởi động được. Kiểm tra với:
free -h
Kiểm tra dung lượng ổ cứng
df -h
Kiểm tra mức độ sử dụng CPU
top
Nếu tài nguyên quá tải, cân nhắc tối ưu hoặc nâng cấp phần cứng.
5. Ngăn chặn lỗi trong tương lai
Cấu hình giám sát daemon
Sử dụng Prometheus + Grafana để giám sát trạng thái daemon:
ceph mgr module enable prometheus
Bật tính năng tự động khôi phục daemon
Bạn có thể bật auto-recovery daemon với:
ceph config set mgr mgr/cephadm_auto_repair true
Định kỳ cập nhật Ceph
Nếu lỗi xuất hiện do phiên bản cũ, hãy cập nhật Ceph:
ceph orch upgrade start --ceph-version latest
6. Kết luận
Lỗi CEPHADM_FAILED_DAEMON là một trong những lỗi phổ biến trong Ceph khi sử dụng Cephadm để quản lý dịch vụ. Việc xác định nguyên nhân và khắc phục kịp thời sẽ giúp hệ thống hoạt động ổn định hơn. Hy vọng bài viết này giúp bạn hiểu rõ hơn về lỗi này và cách khắc phục hiệu quả!