Saturday, May 13, 2006

SMART monitoring on Mac OS X

Unfortunately, I've recently had reason to want to check the SMART status of my PowerBook's internal disk.

Apple's Disk Utility will tell you a simple pass/fail summary of the SMART status of the disk, but it won't go into any detail, show you logs, or details of particular errors. What's more, I believe that it only looks at certain classes of errors to determing the Verified of Failing status, and so if the drive is experiencing errors, such as it's firmware silently remapping sectors, it's still shown as A-OK.

I decided to go hunting for some more useful utilities, and firstly discovered SMARTReporter which displays an icon in the menu bar showing you a visual status of the disk. This utility, however, appears to use the same criteria as Disk Utility for determining as simple pass/fail status, so in this particular case isn't that useful to me. It is handy in a more general sense as it can monitor multiple disks, and upon a SMART failure it can do any combination of: pop up an alert dialog, execute another program or send emails to multiple addresses.

So, digging a bit deeper, I then found the smartmontools which consists of smartctl and smartd that together will report on a whole lot of low-level SMART information, including showing the drive's error log, and other useful information.

It shows me that although Disk Utility et al think that my disk is OK, it's actually logged 2732 errors over it's lifetime and also reports on thresholds the drive is recording, as shown below:

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 096 096 050 Pre-fail Always - 0
2 Throughput_Performance 0x0005 100 100 050 Pre-fail Offline - 0
3 Spin_Up_Time 0x0027 100 100 001 Pre-fail Always - 1356
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 2361
5 Reallocated_Sector_Ct 0x0033 100 100 050 Pre-fail Always - 695
7 Seek_Error_Rate 0x000b 100 100 050 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 100 100 050 Pre-fail Offline - 0
9 Power_On_Hours 0x0032 089 089 000 Old_age Always - 4727
10 Spin_Retry_Count 0x0033 147 100 030 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 2268
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 9
193 Load_Cycle_Count 0x0032 055 055 000 Old_age Always - 456326
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 44 (Lifetime Min/Max 11/55)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 291
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 14
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
220 Disk_Shift 0x0002 100 100 000 Old_age Always - 8343
222 Loaded_Hours 0x0032 094 094 000 Old_age Always - 2636
223 Load_Retry_Count 0x0032 100 100 000 Old_age Always - 0
224 Load_Friction 0x0022 100 100 000 Old_age Always - 0
226 Load-in_Time 0x0026 100 100 000 Old_age Always - 222
240 Head_Flying_Hours 0x0001 100 100 001 Pre-fail Offline - 0

It also shows the most recent five errors, one of which looks like:

Error 32 occurred at disk power-on lifetime: 4727 hours (196 days + 23 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
10 51 f0 b8 bd 81 46 Error: IDNF 240 sectors at LBA = 0x0681bdb8 = 109166008

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 f8 b0 bb 81 40 00 09:40:17.577 READ DMA EXT
35 00 50 e3 27 04 40 00 09:40:17.577 WRITE DMA EXT
35 00 08 b8 c5 8f 40 00 09:40:16.283 WRITE DMA EXT
35 00 08 98 0d 90 40 00 09:40:16.283 WRITE DMA EXT
25 00 08 80 16 8a 40 00 09:40:14.891 READ DMA EXT


So, even though my hard drive is actually logging read/write errors, Disk Utility thinks it's all peachy. The drive has been playing up a little bit this arvo - running OS X just all of a sudden got very s-l-o-w. CPU usage was minimal, there weren't hundreds of threads in the run queue, VM usage was acceptable, even disk I/Os were reporting low numbers, but anything that was disk-related (so just about everything) was crawling along. Switching tasks was quick, scrolling through windows was fine, but even at the terminal, whenever you entered a command, there was a big pause before it ran, and then afterwards again as it had to reload the shell, another huge pause.

I'm now trying to copy my home folder off the drive (as a 5GB FileVault sparse image) and then I will attempt recovery of other files, but it seems that the drive is on it's way out.

Once I've got my FileVault image off, I'm going to shut the PowerBook down for the night, and give it a break, and then try and make a diskimage of the whole drive tomorrow morning. Wish me luck!

9 Comments:

At Saturday, May 20, 2006 4:54:00 AM , bj73 said...

I am running smartctl daily on 150.000 (windows) clients. It really is a great tool if you have a script set up to periodically check your discs and report any errors or warnings.

The raw values (last column) of "Reallocated Sector Count" and "Current Pending Sectors" are most interesting if it comes to defective sectors. If the drive cannot read a sector, it will flag this sector as "pending". The sector can get off the pending list if a later attempt to read the data succeds, however this is very unlikely. A pending sector will be reallocated when data is written to it --> Current Pending goes down by 1 and Reallocated Sector Count is increased. Reallocation means, that any access to this sector is satisfied by accessing another sector from a spare area. The drives firmware keeps a map of those reallocations. If your drive starts showing defective sectors or has a high number of reallocated sectors disc access is slowed down, because the head must travel between the normal data area and the spare area for reallocations.

You should try to get your data off the drive as soon as possible.
Writing zeros (or anything) to the entire drive should force any pending sector to be reallocated. Though I can tell from experience that most drives that start reallocating sectors will shortly go down in a death spiral of pending and reallocations. When the spare area is exhausted the attribute "Reallocated Sector Count" will fail causing most SMART monitoring tools to ring the alarm bell. But beware, modern drives can compensate up to 2500 bad sectors.
On the other hand some drives live happily for years with some hundred reallocated sectors.

On a side note: The "Value" column should be read as: Higher=Better, it is scaled from 1 to 100 or 1 to 253, that varies from vendor to vendor. The Thresh column indicates the lower limit. If Value falls below this number, the attribute is failing. Though only attributes marked as pre_fail will force the overall status to fail. E.g. the power_on attribute may fall below its threshold when the drive reaches its calculated end of life, though this will not stop the drive from operating properly for the next 3 years.

You may also want to issue a SMART self test command to the drive:
smartctl -t short [drive]

On my 733-Quicksilver at home, smartctl fails to operate on a hard drive connected to a PCI based IDE controller.

 
At Sunday, May 21, 2006 10:34:00 AM , kai said...

Yep - i agree, it is a great utility... I've now got smartd running on my laptop, and next up I need to set up mail so that I can receive email notifications, as I rarely get into console.app to see what's going on unless there's something wrong...
I was unable to get smartd to be launching via launchd/launchctl as it immediately forks after launching and kills it's parent process, and launchd really doesn't like that.
I'm also of the opinion that as soon as bad sectors start appearing then the drive is on the way out - I just wish that the OS was more responsive in warning me, rather than waiting till there were 800-900 bad sectors remapped before telling me there was something wrong. It would have allowed me a lot more time to get data off the drive, however as it was I was luckily able to recover everything important...

 
At Thursday, June 01, 2006 4:21:00 AM , Anonymous said...

anybody tried this on an intel mac? i am running it on my macbook pro and getting "Operation not supported by device"

system:~ user$ df
Filesystem 512-blocks Used Avail Capacity Mounted on
/dev/disk0s2 194437600 165513120 28412480 85% /
devfs 194 194 0 100% /dev
fdesc 2 2 0 100% /dev
[volfs] 1024 1024 0 100% /.vol
/dev/disk1s3 488134944 428427344 59707600 88% /Volumes/lacie2
automount -nsl [253] 0 0 0 100% /Network
automount -fstab [257] 0 0 0 100% /automount/Servers
automount -static [257] 0 0 0 100% /automount/static
system:~ user$ smartctl -a /dev/disk0s2
smartctl version 5.36 [powerpc-apple-darwin8.6.0] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Smartctl open device: /dev/disk0s2 failed: Operation not supported by device
system:~ user$ smartctl -a disk0s2
smartctl version 5.36 [powerpc-apple-darwin8.6.0] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Smartctl open device: disk0s2 failed: Operation not supported by device
system:~ user$ smartctl -a disk0
smartctl version 5.36 [powerpc-apple-darwin8.6.0] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Smartctl open device: disk0 failed: Operation not supported by device
system:~ user$ smartctl -a disk1
smartctl version 5.36 [powerpc-apple-darwin8.6.0] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Smartctl open device: disk1 failed: Operation not supported by device

 
At Thursday, June 01, 2006 8:00:00 AM , kai said...

You've got it a bit back to front.
smartctl -a /dev/disk0 will check the internal disk in your MacBook Pro. You're trying to check an individual partition (disk0s2) which doesn't make sense.
You also can't check FireWire disks with it, even though the disk in the FireWire enclosure might support SMART (and probably does) the FireWire to ATA bridge that the disk uses doesn't support gathering SMART status information

 
At Sunday, June 04, 2006 11:06:00 AM , Anonymous said...

Unfortunately smartctl -a /dev/disk0 also produces "Operation not supported by device" on a MBP.

 
At Sunday, June 11, 2006 2:07:00 AM , Anonymous said...

Our Mac guy (Geoff Keating) reports that he is close to having smartmontools working on Mac OS X on Intel.

Bruce Allen

 
At Sunday, June 11, 2006 10:59:00 AM , kai said...

Fantastic!
I'm close to getting a MBP, so would love to have smartmontools happening on the intel platform!

 
At Tuesday, November 28, 2006 5:13:00 AM , Anonymous said...

I am Wondering if Someone could make a nice interface and pack it in DMG format to use?

 
At Tuesday, November 28, 2006 11:20:00 AM , kai said...

For a .pkg style installer, check out the comments on this Mac OS X Hints thread

 

Post a Comment

<< Home