Monday, March 24, 2025

Lỗi CEPHADM_FAILED_DAEMON

-

1. Tổng quan.

Lỗi CEPHADM_FAILED_DAEMON xuất hiện khi một hoặc nhiều daemon của Ceph bị lỗi và không thể khởi động hoặc duy trì trạng thái hoạt động ổn định. Trong thông báo lỗi trên, các daemon node-exporter trên nhiều node khác nhau đang gặp sự cố.

Nguyên nhân phổ biến:

  • Dịch vụ daemon bị crash hoặc không khởi động được.
  • Thiếu thư viện hoặc file cần thiết.
  • Vấn đề về quyền hạn hoặc file permission.
  • Lỗi trong quá trình nâng cấp hoặc cập nhật Ceph.
  • Sự cố về network hoặc config host.
  • Thiếu tài nguyên hệ thống như CPU, RAM hoặc dung lượng ổ đĩa.

2. Kiểm tra trạng thái daemon gặp lỗi

Bước 1: Xác định daemon gặp lỗi

Trước tiên, chạy lệnh sau để xem danh sách daemon bị lỗi:

ceph health detail

Ví dụ ở dưới, gần như tất cả daemon của node-exporter ở các node đã gặp lỗi.

HEALTH_WARN 5 failed cephadm daemon(s)
[WRN] CEPHADM_FAILED_DAEMON: 5 failed cephadm daemon(s)
    daemon node-exporter.CEPH-LAB-MON-071 on CEPH-LAB-MON-071 is in error state
    daemon node-exporter.CEPH-LAB-MON-072 on CEPH-LAB-MON-072 is in error state
    daemon node-exporter.CEPH-LAB-MON-073 on CEPH-LAB-MON-073 is in error state
    daemon node-exporter.CEPH-LAB-OSD-074 on CEPH-LAB-OSD-074 is in error state
    daemon node-exporter.CEPH-LAB-OSD-075 on CEPH-LAB-OSD-075 is in error state
    daemon node-exporter.CEPH-LAB-OSD-076 on CEPH-LAB-OSD-076 is in error state

Bước 2: Kiểm tra trạng thái chi tiết của daemon

Chạy lệnh sau để kiểm tra trạng thái chi tiết của daemon cụ thể:

shell> ceph orch ps --daemon-type node-exporter
NAME                            HOST              PORTS   STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION    IMAGE ID      CONTAINER ID
node-exporter.CEPH-LAB-MON-071  CEPH-LAB-MON-071  *:9100  error             5m ago  49m        -        -  <unknown>  <unknown>     <unknown>
node-exporter.CEPH-LAB-MON-072  CEPH-LAB-MON-072  *:9100  error             5m ago  18m        -        -  <unknown>  <unknown>     <unknown>
node-exporter.CEPH-LAB-MON-073  CEPH-LAB-MON-073  *:9100  error             5m ago  18m        -        -  <unknown>  <unknown>     <unknown>
node-exporter.CEPH-LAB-OSD-074  CEPH-LAB-OSD-074  *:9100  error             5m ago  14m        -        -  <unknown>  <unknown>     <unknown>
node-exporter.CEPH-LAB-OSD-075  CEPH-LAB-OSD-075  *:9100  error             4m ago  14m        -        -  <unknown>  <unknown>     <unknown>
node-exporter.CEPH-LAB-OSD-076  CEPH-LAB-OSD-076  *:9100  error             4m ago  14m        -        -  <unknown>  <unknown>     <unknown>

Hoặc để xem toàn bộ daemon bị lỗi:

shell> ceph orch ps --format json-pretty | grep -B5 'error'
    "ports": [
      9100
    ],
    "service_name": "node-exporter",
    "status": -1,
    "status_desc": "error"
--
    "ports": [
      9100
    ],
    "service_name": "node-exporter",
    "status": -1,
    "status_desc": "error"
--
    "ports": [
      9100
    ],
    "service_name": "node-exporter",
    "status": -1,
    "status_desc": "error"
--
    "ports": [
      9100
    ],
    "service_name": "node-exporter",
    "status": -1,
    "status_desc": "error"
--
    "ports": [
      9100
    ],
    "service_name": "node-exporter",
    "status": -1,
    "status_desc": "error"

3. Khắc phục lỗi daemon bị thất bại

Cách 1: Thử restart daemon

Có thể check trước status của daemon.

shell> systemctl status ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-MON-071.service
○ ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-MON-071.service - Ceph node-exporter.CEPH-LAB-MON-071 for 75ac298c-0653-11f0-a2e7-2b96c52a296a
     Loaded: loaded (/etc/systemd/system/ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@.service; disabled; vendor preset: enabled)
     Active: inactive (dead)

Nếu daemon chỉ bị lỗi tạm thời, bạn có thể thử restart nó:

systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-071.service

Check lại status của daemon sau khi khởi động xong

shell> systemctl status ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-MON-071.service
● ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-MON-071.service - Ceph node-exporter.CEPH-LAB-MON-071 for 75ac298c-0653-11f0-a2e7-2b96c52a296a
     Loaded: loaded (/etc/systemd/system/ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2025-03-21 20:47:02 +07; 1min 2s ago
    Process: 36853 ExecStartPre=/bin/rm -f /run/ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-MON-071.service-pid /run/ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-MON-071.service-cid (code=exited, >
    Process: 36854 ExecStart=/bin/bash /var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/node-exporter.CEPH-LAB-MON-071/unit.run (code=exited, status=0/SUCCESS)
   Main PID: 37042 (conmon)
      Tasks: 8 (limit: 9830)
     Memory: 7.6M
        CPU: 383ms
     CGroup: /system.slice/system-ceph\x2d75ac298c\x2d0653\x2d11f0\x2da2e7\x2d2b96c52a296a.slice/ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-MON-071.service
             ├─libpod-payload-9d85503a97ca590359a5f3b742df1f19b20f60dcbc0e8f35ca822d881b519484
             │ ├─37045 /dev/init -- /bin/node_exporter --no-collector.timex --web.listen-address=:9100 --path.procfs=/host/proc --path.sysfs=/host/sys --path.rootfs=/rootfs
             │ └─37047 /bin/node_exporter --no-collector.timex --web.listen-address=:9100 --path.procfs=/host/proc --path.sysfs=/host/sys --path.rootfs=/rootfs
             └─supervisor
               └─37042 /usr/bin/conmon --api-version 1 -c 9d85503a97ca590359a5f3b742df1f19b20f60dcbc0e8f35ca822d881b519484 -u 9d85503a97ca590359a5f3b742df1f19b20f60dcbc0e8f35ca822d881b519484 -r /usr/bin/crun -b /var/lib/containers/storag>

Hoặc restart tất cả daemon bị lỗi:

systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-071.service
systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-072.service
systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-073.service
systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-074.service
systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-075.service
systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-076.service

Cách 2: Xóa daemon và deploy lại

Nếu restart không giúp khắc phục lỗi, bạn có thể thử xóa daemon và deploy lại:

shell> ceph orch rm node-exporter --force
Removed service node-exporter

shell> ceph orch apply node-exporter
Scheduled node-exporter update...

Cách 3: Kiểm tra logs để tìm nguyên nhân gốc

Sau một hồi tra logs thì mình nhận thấy lỗi do sai url image.

shell> ceph health detail
HEALTH_WARN Failed to place 2 daemon(s); 3 failed cephadm daemon(s)
[WRN] CEPHADM_DAEMON_PLACE_FAIL: Failed to place 2 daemon(s)
    Failed while placing node-exporter.CEPH-LAB-OSD-076 on CEPH-LAB-OSD-076: cephadm exited with an error code: 1, stderr: Non-zero exit code 125 from /usr/bin/podman container inspect --format {{.State.Status}} ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-node-exporter-CEPH-LAB-OSD-076
/usr/bin/podman: stderr Error: error inspecting object: no such container ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-node-exporter-CEPH-LAB-OSD-076
Non-zero exit code 125 from /usr/bin/podman container inspect --format {{.State.Status}} ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-node-exporter.CEPH-LAB-OSD-076
/usr/bin/podman: stderr Error: error inspecting object: no such container ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-node-exporter.CEPH-LAB-OSD-076
Deploy daemon node-exporter.CEPH-LAB-OSD-076 ...
Verifying port 0.0.0.0:9100 ...
Non-zero exit code 1 from systemctl start ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-076
systemctl: stderr Job for ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-076.service failed because the control process exited with error code.
systemctl: stderr See "systemctl status ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-076.service" and "journalctl -xeu ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-076.service" for details.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 10889, in <module>
  File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 10877, in main
  File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 6786, in command_deploy_from
  File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 6804, in _common_deploy
  File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 6856, in _dispatch_deploy
  File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 3964, in deploy_daemon
  File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 4207, in deploy_daemon_units
  File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 2266, in call_throws
RuntimeError: Failed command: systemctl start ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-076: Job for ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-076.service failed because the control process exited with error code.
See "systemctl status ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-076.service" and "journalctl -xeu ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-076.service" for details.
    Failed while placing node-exporter.CEPH-LAB-OSD-075 on CEPH-LAB-OSD-075: cephadm exited with an error code: 1, stderr: Non-zero exit code 125 from /usr/bin/podman container inspect --format {{.State.Status}} ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-node-exporter-CEPH-LAB-OSD-075
/usr/bin/podman: stderr Error: error inspecting object: no such container ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-node-exporter-CEPH-LAB-OSD-075
Non-zero exit code 125 from /usr/bin/podman container inspect --format {{.State.Status}} ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-node-exporter.CEPH-LAB-OSD-075
/usr/bin/podman: stderr Error: error inspecting object: no such container ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-node-exporter.CEPH-LAB-OSD-075
Deploy daemon node-exporter.CEPH-LAB-OSD-075 ...
Verifying port 0.0.0.0:9100 ...
Non-zero exit code 1 from systemctl start ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-075
systemctl: stderr Job for ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-075.service failed because the control process exited with error code.
systemctl: stderr See "systemctl status ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-075.service" and "journalctl -xeu ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-075.service" for details.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 10889, in <module>
  File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 10877, in main
  File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 6786, in command_deploy_from
  File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 6804, in _common_deploy
  File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 6856, in _dispatch_deploy
  File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 3964, in deploy_daemon
  File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 4207, in deploy_daemon_units
  File "/var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/cephadm.91b52e446d8f1d91339889933063a5070027dc00f54d563f523727c6dd22b172/__main__.py", line 2266, in call_throws
RuntimeError: Failed command: systemctl start ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-075: Job for ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-075.service failed because the control process exited with error code.
See "systemctl status ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-075.service" and "journalctl -xeu ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@node-exporter.CEPH-LAB-OSD-075.service" for details.
[WRN] CEPHADM_FAILED_DAEMON: 3 failed cephadm daemon(s)
    daemon node-exporter.CEPH-LAB-MON-072 on CEPH-LAB-MON-072 is in unknown state
    daemon node-exporter.CEPH-LAB-MON-073 on CEPH-LAB-MON-073 is in unknown state
    daemon node-exporter.CEPH-LAB-OSD-074 on CEPH-LAB-OSD-074 is in unknown state

Cách giải quyết là mình đã cập nhật lại url image như dưới.

# Set ceph_repositort_host and ceph_repositort_port
ceph_repositort_host="10.237.7.74"
ceph_repositort_port="5000"

# Config use images in cluster
ceph config set mgr mgr/cephadm/container_image_prometheus ${ceph_repositort_host}:${ceph_repositort_port}/prometheus/prometheus:v2.43.0
ceph config set mgr mgr/cephadm/container_image_grafana ${ceph_repositort_host}:${ceph_repositort_port}/ceph/ceph-grafana:8.3.5
ceph config set mgr mgr/cephadm/container_image_alertmanager ${ceph_repositort_host}:${ceph_repositort_port}/prometheus/alertmanager:v0.25.0
ceph config set mgr mgr/cephadm/container_image_node_exporter ${ceph_repositort_host}:${ceph_repositort_port}/prometheus/node-exporter:v1.5.0

Verify lại thông tin url image sau khi cập nhật.

shell> ceph config get mgr
WHO     MASK  LEVEL     OPTION                                     VALUE                                                                                               RO
global        advanced  cluster_network                            10.237.7.0/24                                                                                       *
global        basic     container_image                            10.237.7.74:5000/ceph/ceph@sha256:479f0db9298e37defcdedb1edb8b8db25dd0b934afd6b409d610a7ed81648dbc  *
mgr           advanced  mgr/cephadm/container_image_alertmanager   10.237.7.74:5000/prometheus/alertmanager:v0.25.0                                                    *
mgr           advanced  mgr/cephadm/container_image_grafana        10.237.7.74:5000/ceph/ceph-grafana:8.3.5                                                            *
mgr           advanced  mgr/cephadm/container_image_node_exporter  10.237.7.74:5000/prometheus/node-exporter:v1.5.0                                                    *
mgr           advanced  mgr/cephadm/container_image_prometheus     10.237.7.74:5000/prometheus/prometheus:v2.43.0                                                      *
mgr           advanced  mgr/cephadm/container_init                 True                                                                                                *
mgr           advanced  mgr/cephadm/migration_current              6                                                                                                   *
mgr           advanced  mgr/dashboard/ALERTMANAGER_API_HOST        http://CEPH-LAB-MON-071:9093                                                                        *
mgr           advanced  mgr/dashboard/GRAFANA_API_SSL_VERIFY       false                                                                                               *
mgr           advanced  mgr/dashboard/GRAFANA_API_URL              https://CEPH-LAB-MON-071:3000                                                                       *
mgr           advanced  mgr/dashboard/PROMETHEUS_API_HOST          http://CEPH-LAB-MON-071:9095                                                                        *
mgr           advanced  mgr/dashboard/ssl_server_port              8443                                                                                                *

Hoặc ngắn gọn hơn.

shell> ceph config dump | grep mgr/cephadm
mgr                            advanced  mgr/cephadm/container_image_alertmanager   10.237.7.74:5000/prometheus/alertmanager:v0.25.0                                                    *
mgr                            advanced  mgr/cephadm/container_image_grafana        10.237.7.74:5000/ceph/ceph-grafana:8.3.5                                                            *
mgr                            advanced  mgr/cephadm/container_image_node_exporter  10.237.7.74:5000/prometheus/node-exporter:v1.5.0                                                    *
mgr                            advanced  mgr/cephadm/container_image_prometheus     10.237.7.74:5000/prometheus/prometheus:v2.43.0                                                      *
mgr                            advanced  mgr/cephadm/container_init                 True                                                                                                *
mgr                            advanced  mgr/cephadm/migration_current              6                                                                                                   *

Sau đó thực hiện lại các bước trên nếu bạn muốn chúng cập nhật liền, ví dụ.

systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-071.service
systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-072.service
systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-073.service
systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-074.service
systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-075.service
systemctl restart ceph-<ceph_fsid>@node-exporter.CEPH-LAB-MON-076.service

Hoặc.

shell> ceph orch rm node-exporter --force
Removed service node-exporter

shell> ceph orch apply node-exporter
Scheduled node-exporter update...

Cách 4: Kiểm tra tài nguyên hệ thống

Nếu server thiếu tài nguyên, daemon có thể không khởi động được. Kiểm tra với:

free -h

Kiểm tra dung lượng ổ cứng

df -h

Kiểm tra mức độ sử dụng CPU

top

Nếu tài nguyên quá tải, cân nhắc tối ưu hoặc nâng cấp phần cứng.

5. Ngăn chặn lỗi trong tương lai

Cấu hình giám sát daemon

Sử dụng Prometheus + Grafana để giám sát trạng thái daemon:

ceph mgr module enable prometheus

Bật tính năng tự động khôi phục daemon

Bạn có thể bật auto-recovery daemon với:

ceph config set mgr mgr/cephadm_auto_repair true

Định kỳ cập nhật Ceph

Nếu lỗi xuất hiện do phiên bản cũ, hãy cập nhật Ceph:

ceph orch upgrade start --ceph-version latest

6. Kết luận

Lỗi CEPHADM_FAILED_DAEMON là một trong những lỗi phổ biến trong Ceph khi sử dụng Cephadm để quản lý dịch vụ. Việc xác định nguyên nhân và khắc phục kịp thời sẽ giúp hệ thống hoạt động ổn định hơn. Hy vọng bài viết này giúp bạn hiểu rõ hơn về lỗi này và cách khắc phục hiệu quả!

LEAVE A REPLY

Please enter your comment!
Please enter your name here

4,956FansLike
256FollowersFollow
223SubscribersSubscribe
spot_img

Related Stories