Last modified February 7, 2021
This site has been automatically translated with Google Translate from this original page written in french, there may be some translation errors
The smartmontools tools are based on a smartd daemon which is more suited to servers and machines that run 24/7 and on the smartctl tool which can be launched punctually or by means of a script on a PC that is switched on intermittently. On my Mageia, I installed the default package that can be found on any modern distribution.
Both tools allow you to run tests that are the only way to retrieve disk operating information. There are two types of tests:
You should know that there are two schools of thought for using disk tests, one recommends using tests to prevent failures, even if SMART only detects 30% of these failures in practice, the other school believes that the tests are aggressive and contribute to damaging disks, reducing their lifespan and de facto increasing the risk of failures! As for me, I chose to use SMART on my server.
In the rest of this page, I have taken as an example my Dell PowerEdge T310 server which has a configuration based on a Dell PERC 6/i RAID controller with:
On the USB port I have a 4TB SATA disk for data backup.
The smartd server monitors hard drives and sends alerts in case of errors in the background. To configure it, we will modify the file /etc/smartd.conf here is its content:
DEVICESCAN -o on -S on -s (S/../.././01|L/../../1/03) -m olivier -M exec /usr/bin/mail
This means that all disks will undergo a short test every day at 1am and a long test on Mondays at 3am, in case of errors encountered an email will be sent to the user olivier . More precisely in the detail of each option:
DEVICESCAN allows you to specify all disks, but you can also specify a specific disk, for example /dev/sda -a for a SATA disk with the special file /dev/sda. The smartctl --scan command will help you find the identifier of each of the detected disks, here is the result on my server:
-o applies only to SATA drives with on on enables SMART Automatic Offline Testing, which instructs the drive to update SMART operating data every 4 hours.
-S to on enables Attribute Autosave to save operating data such as error counters, power-on time so that they are not reset to zero every time the drive is turned off and on again. This assumes that your drive has internal memory to store this operating data.
-s to run the tests at a scheduled time, we will rely on regular expressions that have this form ( T/MM/DD/d/HH)
In this case, the S/../.././01 therefore corresponds to a short test every day at 1am, the L/../../1/03 corresponds to a long test every Monday at 3am, and now the (S/../.././01|L/../../1/03) with a | (and) between the two corresponds to the addition of the two conditions.
For more information on the file syntax, see the man here .
We launch or restart smartd by typing systemctl start smartd , journalctl -f gives:
Nov 08 19:29:05 mana.kervao.fr smartd[16392]: smartd 7.0 2018-12-30 r4883 [x86_64-linux-5.6.6-server-1.mga7] (local build)This message was generated
by the smartd daemon running on:
host name:
mana
DNS domain:
kervao.fr
The following
warning/error was logged by the smartd daemon:
TEST EMAIL from smartd for
device: /dev/sdc [SAT]
Device info:
ST4000DM004-2CV104,
S/N:WFN0VFTR, WWN:5-000c50-0be69c167, FW:0001, 4.00 TB
For details see host's
SYSLOG.
On the other hand, on PCs that are not permanently on, you can occasionally run the disk health check commands. In practice, if I take as an example my internal disk identified by the special file /dev/sdc , you must first type the command smartctl -t short /dev/sdc , here is the result:
smartctl 7.0 2018-12-30
r4883 [x86_64-linux-5.7.19-desktop-3.mga7] (local build)
Copyright (C) 2002-18,
Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE
IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute
SMART Short self-test routine immediately in off-line mode".
Drive command "Execute
SMART Short self-test routine immediately in off-line mode"
successful.
Testing has begun.
Please wait 2 minutes for
test to complete.
Test will complete after
Mon Nov 9 18:48:33 2020
Use smartctl -X to abort
test.
Then two minutes later we type smartctl -l selftest /dev/sdc and here is the result:
smartctl 7.0 2018-12-30
r4883 [x86_64-linux-5.7.19-desktop-3.mga7] (local build)
Copyright (C) 2002-18,
Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART
DATA SECTION ===
SMART Self-test log
structure revision number 1
Num Test_Description
Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline
Completed without error 00% 326 -
# 2 Extended offline
Completed without error 00% 17 -
It's all good, there is no mistake.
To automate all this, I suggest you use anacron .
On the other hand, for a disk connected via USB (via an adapter or a case), you have the choice of putting as a variable behind -d sat, usbsunplus , usbcypress and usbjmicron . In practice, you will have to type lsusb , you will find your hard disk connected via USB, for example with a SATA/USB adapter
Bus 002 Device 010: ID 7825:a2a4 ULT-Best Best USB Device
Now you have to go to this page and see which device your adapter or enclosure corresponds to, be careful not all devices are listed, so you will have to do tests with all the devices listed above, for my part it was -d sat for both the SATA/USB adapter and the USB enclosure. The command to type for a short test will therefore be (if the special file is /dev/sde )
smartctl -t short -d sat /dev/sde
Now let's move on to the long test, for a disk of the RAID 5 of my server (the 7th on the chain), we will type
smartctl -t long -d megaraid,7 -a /dev/sdb
here is the result
smartctl 7.0 2018-12-30
r4883 [x86_64-linux-5.6.6-server-1.mga7] (local build)
Copyright (C) 2002-18,
Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION
SECTION ===
Model Family: Seagate
IronWolf
Device Model:
ST3000VN007-2AH16M
Serial Number: ZGY6KHBC
LU WWN Device Id: 5 000c50
0c47c70df
Firmware Version: SC60
User Capacity: 3 000 592
982 016 bytes [3.00 TB]
Sector Sizes: 512 bytes
logical, 4096 bytes physical
Rotation Rate: 5980 rpm
Form Factor: 3.5 inches
Device is: In smartctl
database [for details use: -P show]
ATA Version is: ACS-3
T13/2161-D revision 5
SATA Version is: SATA 3.1,
6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Sat Nov 21
09:36:58 2020 CET
SMART support is:
Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART
DATA SECTION ===
SMART Status not
supported: ATA return descriptor not supported by controller
firmware
SMART overall-health
self-assessment test result: PASSED
Warning: This result is
based on an Attribute check.
General SMART Values:
Offline data collection
status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution
status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete
Offline
data collection: (601)
seconds.
Offline data collection
capabilities: (0x73) SMART
execute Offline immediately.
Auto Offline data collection on/off
support.
Suspend Offline collection upon new
order.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:
(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:
(0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time:
(1) minutes.
Extended self-test routine
recommended polling time:
(525) minutes.
Conveyance self-test
routine
recommended polling time:
(2) minutes.
SCT capabilities: (0x50bd)
SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data
Structure revision number: 10
Vendor Specific SMART
Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG
VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1
Raw_Read_Error_Rate 0x000f 079 064 044 Pre-fail Always -
78583125
3 Spin_Up_Time
0x0003 095 095 000 Pre-fail Always - 0
4 Start_Stop_Count
0x0032 100 100 020 Old_age Always - 8
5
Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate
0x000f 085 060 045 Pre-fail Always - 345234219
9 Power_On_Hours
0x0032 094 094 000 Old_age Always - 6111 (71 225 0)
10 Spin_Retry_Count
0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count
0x0032 100 100 020 Old_age Always - 8
184 End-to-End_Error
0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect
0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032
100 099 000 Old_age Always - 4295032835
189 High_Fly_Writes 0x003a
100 100 000 Old_age Always - 0
190
Airflow_Temperature_Cel 0x0022 082 073 040 Old_age Always -
18 (Min/Max 15/25)
191 G-Sense_Error_Rate
0x0032 100 100 000 Old_age Always - 0
192
Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always -
0
193 Load_Cycle_Count
0x0032 075 075 000 Old_age Always - 50405
194 Temperature_Celsius
0x0022 018 040 000 Old_age Always - 18 (0 14 0 0 0)
197 Current_Pending_Sector
0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable
0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count
0x003e 193 193 000 Old_age Always - 25
240 Head_Flying_Hours
0x0000 100 253 000 Old_age Offline - 1894 (157 172 0)
241 Total_LBAs_Written
0x0000 100 253 000 Old_age Offline - 17183414698
242 Total_LBAs_Read 0x0000
100 253 000 Old_age Offline - 6251948358
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log
structure revision number 1
Num Test_Description
Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline
Completed without error 00% 6111 -
# 2 Short offline
Completed without error 00% 6102 -
# 3 Short offline
Completed without error 00% 6078 -
# 4 Short offline
Completed without error 00% 6054 -
# 5 Short offline
Completed without error 00% 6030 -
# 6 Short offline
Completed without error 00% 6006 -
# 7 Extended offline
Completed without error 00% 5990 -
# 8 Short offline
Completed without error 00% 5982 -
# 9 Short offline
Completed without error 00% 5958 -
#10 Short offline
Completed without error 00% 5934 -
#11 Short offline
Completed without error 00% 5910 -
#12 Short offline
Completed without error 00% 5886 -
#13 Short offline
Completed without error 00% 5862 -
#14 Short offline
Completed without error 00% 5838 -
SMART Selective self-test
log data structure revision number 1
SPAN MIN_LBA MAX_LBA
CURRENT_TEST_STATUS
1 0 0
Not_testing
2 0 0
Not_testing
3 0 0
Not_testing
4 0 0
Not_testing
5 0 0
Not_testing
Selective self-test flags
(0x0):
After scanning
selected spans, do NOT read-scan remainder of disk.
If Selective self-test is
pending on power-up, resume after 0 minute delay.
=== START OF OFFLINE
IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute
SMART Extended self-test routine immediately in off-line
mode".
Drive command "Execute
SMART Extended self-test routine immediately in off-line
mode" successful.
Testing has begun.
Please wait 525 minutes
for test to complete.
Test will complete after
Sat Nov 21 18:21:59 2020
Use smartctl -X to abort
test.
Basically it will last 525 minutes (or almost 9 hours for a 3TB disk!!). To see the progress of the order you will have to type regularly
smartctl -a -d megaraid,7 -a /dev/sdb
here is the result
Self-test execution status:
(246) Self-test routine in progress...
60% of test remaining.
In the system log file we also see progress indications
Nov 21 14:23:04 mana.kervao.fr smartd[6132]: Device: /dev/bus/0 [megaraid_disk_07] [SAT], self-test in progress, 20% remaining
Now here is what it looks like with a disk that has lots of errors, it is an old disk that was in my RAID5.
smartctl 7.0 2018-12-30 r4883
[x86_64-linux-5.7.19-desktop-3.mga7] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke,
www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.14
(AF)
Device Model: ST2000DM001-1ER164
Serial Number: Z4Z2CEHK
LU WWN Device Id: 5 000c50 07ac2d868
Firmware Version: CC25
User Capacity: 2 000 398 934 016 bytes [2,00
TB]
Sector Sizes: 512 bytes logical, 4096
bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl
database [for details use: -P show]
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Nov 22 19:08:14 2020 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output
registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.
General SMART Values:
Offline data collection status: (0x00)
Offline data collection activity
was never started.
Auto Offline Data
Collection: Disabled.
Self-test execution status:
( 0) The previous self-test
routine completed
without error or no
self-test has ever
been run.
Total time to complete Offline
data collection:
( 80) seconds.
Offline data collection
capabilities:
(0x73) SMART execute Offline immediate.
Auto Offline data
collection on/off support.
Suspend Offline collection
upon new
command.
No Offline surface scan
supported.
Self-test supported.
Conveyance Self-test
supported.
Selective Self-test
supported.
SMART
capabilities:
(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save
timer.
Error logging
capability:
(0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time:
(1) minutes.
Extended self-test routine
recommended polling time:
(208) minutes.
Conveyance self-test
routine
recommended polling time:
(2) minutes.
SCT capabilities: (0x1085)
SCT Status supported.
SMART Attributes Data
Structure revision number: 10
Vendor Specific SMART
Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG
VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1
Raw_Read_Error_Rate 0x000f 100 094 006 Pre-fail Always -
138697940
3 Spin_Up_Time
0x0003 096 096 000 Pre-fail Always - 0
4 Start_Stop_Count
0x0032 100 100 020 Old_age Always - 68
5
Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate
0x000f 065 060 030 Pre-fail Always - 3660043
9 Power_On_Hours
0x0032 059 059 000 Old_age Always - 36617
10 Spin_Retry_Count
0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count
0x0032 100 100 020 Old_age Always - 69
183 Runtime_Bad_Block
0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error
0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect
0x0032 088 088 000 Old_age Always - 12
188 Command_Timeout 0x0032
100 100 000 Old_age Always - 0 0 0
189 High_Fly_Writes 0x003a
099 099 000 Old_age Always - 1
190
Airflow_Temperature_Cel 0x0022 080 066 045 Old_age Always -
20 (Min/Max 20/20)
191 G-Sense_Error_Rate
0x0032 100 100 000 Old_age Always - 0
192
Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always -
9
193 Load_Cycle_Count
0x0032 007 007 000 Old_age Always - 187614
194 Temperature_Celsius
0x0022 020 040 000 Old_age Always - 20 (0 7 0 0 0)
197 Current_Pending_Sector
0x0012 099 099 000 Old_age Always - 272
198 Offline_Uncorrectable
0x0010 099 099 000 Old_age Offline - 272
199 UDMA_CRC_Error_Count
0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours
0x0000 100 253 000 Old_age Offline - 20860h+08m+08.480s
241 Total_LBAs_Written
0x0000 100 253 000 Old_age Offline - 2621623756
242 Total_LBAs_Read 0x0000
100 253 000 Old_age Offline - 7556213911
SMART Error Log Version: 1
ATA Error Count: 12
(device log contains only the most recent five errors)
CR =
Command Register [HEX]
FR =
Features Register [HEX]
SC =
Sector Count Register [HEX]
SN =
Sector Number Register [HEX]
CL =
Cylinder Low Register [HEX]
CH =
Cylinder High Register [HEX]
DH =
Device/Head Register [HEX]
DC =
Device Command Register [HEX]
ER =
Error register [HEX]
ST =
Status register [HEX]
Powered_Up_Time is
measured from power on, and printed as
DDd+hh:mm:SS.sss where
DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec.
It "wraps" after 49,710 days.
Error 12 occurred at disk
power-on lifetime: 34776 hours (1449 days + 0 hours)
When the command
that caused the error occurred, the device was active or
idle.
After command
completion occurred, registers were:
ER ST SC SN CL CH
DH
-- -- -- -- -- --
40 51 00 ff ff ff
0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to
the command that caused the error were:
CR FR SC SN CL CH
DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- --
-- -- ---------------- --------------------
60 00 6e ff ff ff
4f 00 34d+16:52:44.064 READ FPDMA QUEUED
2f 00 01 10 00 00
00 00 34d+16:52:44.006 READ LOG EXT
60 00 6f ff ff ff
4f 00 34d+16:52:39.564 READ FPDMA QUEUED
2f 00 01 10 00 00
00 00 34d+16:52:39.506 READ LOG EXT
60 00 70 ff ff ff
4f 00 34d+16:52:35.063 READ FPDMA QUEUED
Error 11 occurred at disk
power-on lifetime: 34776 hours (1449 days + 0 hours)
When the command
that caused the error occurred, the device was active or
idle.
After command
completion occurred, registers were:
ER ST SC SN CL CH
DH
-- -- -- -- -- --
--
40 51 00 ff ff ff
0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to
the command that caused the error were:
CR FR SC SN CL CH
DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- --
-- -- ---------------- --------------------
60 00 6f ff ff ff
4f 00 34d+16:52:39.564 READ FPDMA QUEUED
2f 00 01 10 00 00
00 00 34d+16:52:39.506 READ LOG EXT
60 00 70 ff ff ff
4f 00 34d+16:52:35.063 READ FPDMA QUEUED
2f 00 01 10 00 00
00 00 34d+16:52:35.005 READ LOG EXT
60 00 71 ff ff ff
4f 00 34d+16:52:30.563 READ FPDMA QUEUED
Error 10 occurred at disk
power-on lifetime: 34776 hours (1449 days + 0 hours)
When the command
that caused the error occurred, the device was active or
idle.
After command
completion occurred, registers were:
ER ST SC SN CL CH
DH
-- -- -- -- -- --
40 51 00 ff ff ff
0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to
the command that caused the error were:
CR FR SC SN CL CH
DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- --
-- -- ---------------- --------------------
60 00 70 ff ff ff
4f 00 34d+16:52:35.063 READ FPDMA QUEUED
2f 00 01 10 00 00
00 00 34d+16:52:35.005 READ LOG EXT
60 00 71 ff ff ff
4f 00 34d+16:52:30.563 READ FPDMA QUEUED
2f 00 01 10 00 00
00 00 34d+16:52:30.505 READ LOG EXT
60 00 72 ff ff ff
4f 00 34d+16:52:26.062 READ FPDMA QUEUED
Error 9 occurred at disk
power-on lifetime: 34776 hours (1449 days + 0 hours)
When the command
that caused the error occurred, the device was active or
idle.
After command
completion occurred, registers were:
ER ST SC SN CL CH
DH
-- -- -- -- -- --
40 51 00 ff ff ff
0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to
the command that caused the error were:
CR FR SC SN CL CH
DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- --
-- -- ---------------- --------------------
60 00 71 ff ff ff
4f 00 34d+16:52:30.563 READ FPDMA QUEUED
2f 00 01 10 00 00
00 00 34d+16:52:30.505 READ LOG EXT
60 00 72 ff ff ff
4f 00 34d+16:52:26.062 READ FPDMA QUEUED
2f 00 01 10 00 00
00 00 34d+16:52:26.005 READ LOG EXT
60 00 73 ff ff ff
4f 00 34d+16:52:21.562 READ FPDMA QUEUED
Error 8 occurred at disk
power-on lifetime: 34776 hours (1449 days + 0 hours)
When the command
that caused the error occurred, the device was active or
idle.
After command
completion occurred, registers were:
ER ST SC SN CL CH
DH
-- -- -- -- -- --
--
40 51 00 ff ff ff
0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to
the command that caused the error were:
CR FR SC SN CL CH
DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- --
-- -- ---------------- --------------------
60 00 72 ff ff ff
4f 00 34d+16:52:26.062 READ FPDMA QUEUED
2f 00 01 10 00 00
00 00 34d+16:52:26.005 READ LOG EXT
60 00 73 ff ff ff
4f 00 34d+16:52:21.562 READ FPDMA QUEUED
2f 00 01 10 00 00
00 00 34d+16:52:21.504 READ LOG EXT
60 00 74 ff ff ff
4f 00 34d+16:52:17.061 READ FPDMA QUEUED
SMART Self-test log
structure revision number 1
No self-tests have been
logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test
log data structure revision number 1
SPAN MIN_LBA MAX_LBA
CURRENT_TEST_STATUS
1 0 0
Not_testing
2 0 0
Not_testing
3 0 0
Not_testing
4 0 0
Not_testing
5 0 0
Not_testing
Selective self-test flags
(0x0):
After scanning
selected spans, do NOT read-scan remainder of disk.
If Selective self-test is
pending on power-up, resume after 0 minute delay.
[ Back to FUNIX home page ] |