Self-Monitoring, Analysis and Reporting Technology

S.M.A.R.T. je monitorovací systém, nachádzajúci sa v pevných diskoch počítačov (HDD), ktorého úlohou je monitorovať a zaznamenávať rôzne udalosti a predvídať tak poškodenie disku. Novšie verzie tohto systému sú schopné nie len predvídať chyby, ale dokonca, ak je to možné, sa pokúšajú aj odhalené chyby opraviť. O tom či je toto predvídanie presné sa vedú siahodlhé diskusie, pravda je taká, že dobrý systém monitorovania a zberu dát je často schopný predčasne odhaliť zlyhanie disku, no nie vždy.

Chyby pevných diskov možno rozdeliť na dve základné skupiny:

Predvídateľné
Niektoré chyby totiž možno predvídať, to sa týka napríklad mechanických chýb disku, kedy sa nepodarí pristúpiť k údajom na disku na prvýkrát, takáto udalosť sa tiež zapíše do S.M.A.R.T. tabuľky. Pri blížiacom sa kolapse disku, sa tieto problémy vyskytujú stále častejšie a vtedy je treba čo možno najrýchlejšie zálohovať dáta.

Nepredvídateľné
Vyskytujú sa ale aj zlyhania, ktoré táto technológia predvídať nedokáže. Jedná sa napríklad o zlyhanie elektroniky, devastujúci náraz spôsobený pádom či úderom alebo zlyhanie spôsobené napäťovou špičkou.

História tohto systému siaha až do roku 1992, kedy začalo IBM vyrábať disky s tzv. “Predictive Failure Analysis” (PFA). Systém bol schopný zbierať niektoré základné atribúty disku a v prípade veľkého tresholdu, bol schopný poslať systému chybovú hlášku. Toto sa stalo ANSI štandardom a dnes sa S.M.A.R.T. nachádza v elektronike väčšiny pevných diskov vyrobených po roku 1996.

Ak váš pevný disk má technológiu S.M.A.R.T., mali by ste po zapnutí PC a zavádzaní BIOSu, vidieť hlášku o stave S.M.A.R.T. Systém je schopný vypísať hlášku “threshold not exceeded” alebo “threshold exceeded“, alebo tiež “drive OK” alebo “drive fail“. Hláška “threshold not exceeded” znamená, že disk presiahol povolenú hodnotu thresholdu a v blízkej dobe možno očakávať jeho kolaps. Samozrejme to nemusí byť vždy presný ukazovateľ, ale pri tejto hláške netreba váhať a radšej okamžite zálohovať dáta na disku. Tieto hlásenia ale BIOS často nevypisuje, pretože sú prekryté logom základnej dosky (čo možno vypnúť v nastavení BIOSu) alebo je táto funkcia v BIOSe vypnutá a treba ju povoliť. Na zobrazenie týchto dát ale nepotrebujete vždy sledovať hlášky, ktoré BIOS vypisuje. Môžete si nainštalovať free alebo aj platený softvér do operačného systému a tu je možnosť vidieť všetky zozbierané dáta. Dozviete sa z nich aj zaujímave informácie, ako je napríklad počet hodín, koľko má váš pevný disk “odjazdených”, koľko krát bol zapnutý a podobne.

V operačnom systéme Ubuntu 10.04 a vyššom, je v predinštalovaný celkom šikovný program na sledovanie týchto údajov.
Nájdete ho tu: Main Menu -> System -> Administration -> Disk Utility -> SMART Data
Podrobnosti o tomto postupe nájdete tu.

Pri operačných systémoch Windows môžete na sledovanie stavu disku použiť rôzne free programy. Mne sa najviac zapáčil jednoduchý a prehľadný program HDD Health, ktorý vám ukáže na koľko percent je váš disk dobrý, ale tiež všetky S.M.A.R.T. atribúty. Dokáže bežať aj na pozadí a upozorniť užívateľa, keď disk dosiahne kritické hodnoty. Ďalším vhodným programom DiskCheckup, ktorý je free pre osobné použitie. Alebo tiež program SpeedFan, ktorý je síce primárne určený na kontrolu rýchlostí ventilátorov v PC, ale okrem ďalších zaujímavých vecí, dokáže aj čítať dáta zo S.M.A.R.T. alebo zbehnúť testy pevných diskov.

Nasledujúca tabuľka je prevzatá z Wikipedie a popisuje jednotlivé S.M.A.R.T. atribúty.

 

Legend

Higher
Higher raw value is better
Lower
Lower raw value is better
Critical: red colored row Potential indicators of imminent electromechanical failure

 

ID Hex Attribute name Better Description
01 0×01 Read Error Rate
Lower

(Vendor specific raw value.) Stores data related to the rate of hardware read errors that occurred when reading data from a disk surface. The raw value has different structure for different vendors and is often not meaningful as a decimal number.
02 0×02 Throughput Performance
Higher
Overall (general) throughput performance of a hard disk drive. If the value of this attribute is decreasing there is a high probability that there is a problem with the disk.
03 0×03 Spin-Up Time
Lower
Average time of spindle spin up (from zero RPM to fully operational [millisecs]).
04 0×04 Start/Stop Count A tally of spindle start/stop cycles. The spindle turns on, and hence the count is increased, both when the hard disk is turned on after having before been turned entirely off (disconnected from power source) and when the hard disk returns from having previously been put to sleep mode.
05 0×05 Reallocated Sectors Count
Lower
Count of reallocated sectors. When the hard drive finds a read/write/verification error, it marks that sector as “reallocated” and transfers data to a special reserved area (spare area). This process is also known as remapping, and reallocated sectors are called “remaps”. The raw value normally represents a count of the bad sectors that have been found and remapped. Thus, the higher the attribute value, the more sectors the drive has had to reallocate. This allows a drive with bad sectors to continue operation, however, a drive which has had any reallocations at all is significantly more likely fail in the near future. While primarily used as a metric of the life-expectancy of the drive, this number also affects performance. As the count of reallocated sectors increases, the read/write speed tends to become worse because the drive head is forced to seek to the reserved area whenever a remap is accessed. A workaround which will preserve drive speed at the expense of capacity is to create a disk partition over the region which contains remaps and instruct the operating system to not use that partition.
06 0×06 Read Channel Margin Margin of a channel while reading data. The function of this attribute is not specified.
07 0×07 Seek Error Rate N/A (Vendor specific raw value.) Rate of seek errors of the magnetic heads. If there is a partial failure in the mechanical positioning system, then seek errors will arise. Such a failure may be due to numerous factors, such as damage to a servo, or thermal widening of the hard disk. The raw value has different structure for different vendors and is often not meaningful as a decimal number.
08 0×08 Seek Time Performance
Higher
Average performance of seek operations of the magnetic heads. If this attribute is decreasing, it is a sign of problems in the mechanical subsystem.
09 0×09 Power-On Hours (POH)
Lower
Count of hours in power-on state. The raw value of this attribute shows total count of hours (or minutes, or seconds, depending on manufacturer) in power-on state.
10 0x0A Spin Retry Count
Lower
Count of retry of spin start attempts. This attribute stores a total count of the spin start attempts to reach the fully operational speed (under the condition that the first attempt was unsuccessful). An increase of this attribute value is a sign of problems in the hard disk mechanical subsystem.
11 0x0B Recalibration Retries or Calibration Retry Count
Lower
This attribute indicates the count that recalibration was requested (under the condition that the first attempt was unsuccessful). An increase of this attribute value is a sign of problems in the hard disk mechanical subsystem.
12 0x0C Power Cycle Count This attribute indicates the count of full hard disk power on/off cycles.
13 0x0D Soft Read Error Rate
Lower
Uncorrected read errors reported to the operating system.
183 0xB7 SATA Downshift Error Count Western Digital and Samsung attribute.
184 0xB8 End-to-End error
Lower
This attribute is a part of HP’s SMART IV technology and it means that after transferring through the cache RAM data buffer the parity data between the host and the hard drive did not match.
185 0xB9 Head Stability Western Digital attribute.
186 0xBA Induced Op-Vibration Detection Western Digital attribute.
187 0xBB Reported Uncorrectable Errors
Lower
The count of errors that could not be recovered using hardware ECC (see attribute 195).
188 0xBC Command Timeout
Lower
The count of aborted operations due to HDD timeout. Normally this attribute value should be equal to zero and if the value is far above zero, then most likely there will be some serious problems with power supply or an oxidized data cable.
189 0xBD High Fly Writes
Lower
HDD producers implement a Fly Height Monitor that attempts to provide additional protections for write operations by detecting when a recording head is flying outside its normal operating range. If an unsafe fly height condition is encountered, the write process is stopped, and the information is rewritten or reallocated to a safe region of the hard drive. This attribute indicates the count of these errors detected over the lifetime of the drive.

This feature is implemented in most modern Seagate drives and some of Western Digital’s drives, beginning with the WD Enterprise WDE18300 and WDE9180 Ultra2 SCSI hard drives, and will be included on all future WD Enterprise products.

190 0xBE Airflow Temperature (WDC)
Lower
Airflow temperature on Western Digital HDs (Same as temp. [C2], but current value is 50 less for some models. Marked as obsolete.)
190 0xBE Temperature Difference from 100
Higher
Value is equal to (100-temp. °C), allowing manufacturer to set a minimum threshold which corresponds to a maximum temperature.

(Seagate only?)
Seagate ST910021AS: Verified Present
Seagate ST9120823ASG: Verified Present under name “Airflow Temperature Cel” 2008-10-06
Seagate ST3802110A: Verified Present 2007-02-13
Seagate ST980825AS: Verified Present 2007-04-05
Seagate ST3320620AS: Verified Present 2007-04-23
Seagate ST3500641AS: Verified Present 2007-06-12
Seagate ST3250824AS: Verified Present 2007-08-07
Seagate ST3250620AS: Verified Present
Seagate ST31000340AS: Verified Present 2008-02-05
Seagate ST31000333AS: Verified Present 2008-11-24
Seagate ST3160211AS: Verified Present 2008-06-12
Seagate ST3320620AS: Verified Present 2008-06-12
Seagate ST3400620AS: Verified Present 2008-06-12
Seagate ST3750330AS: Verified present 2009-07-06
Seagate ST3500418AS: Verified present 2010-04-03
Seagate ST31500341AS: Verified present 2010-10-09
Samsung HD501LJ: Verified Present under name “Airflow Temperature” 2008-03-02
Samsung HD753LJ: Verified Present under name “Airflow Temperature” 2008-07-15

A note here: smartctl seems to interpret these correctly at least in 5.39.1:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0×0022 068 057 045 Old_age Always – 32 (Lifetime Min/Max 22/33)

notice “raw value” is 32 (the correct airflow temp in celsius) and value is 100-32 = 68.

191 0xBF G-sense Error Rate
Lower
The count of errors resulting from externally-induced shock & vibration.
192 0xC0 Power-off Retract Count or Emergency Retract Cycle Count (Fujitsu)
Lower
Count of times the heads are loaded off the media. Heads can be unloaded without actually powering off.
193 0xC1 Load Cycle Count or Load/Unload Cycle Count (Fujitsu)
Lower
Count of load/unload cycles into head landing zone position.

?vThe typical lifetime rating for laptop (2.5-in) hard drives is 300,000 to 600,000 load cycles. Some laptop drives are programmed to unload the heads whenever there has not been any activity for about five seconds. Many Linux installations write to the file system a few times a minute in the background. As a result, there may be 100 or more load cycles per hour, and the load cycle rating may be exceeded in less than a year.

194 0xC2 Temperature
Lower
Current internal temperature.
195 0xC3 Hardware ECC Recovered N/A (Vendor specific raw value.) The raw value has different structure for different vendors and is often not meaningful as a decimal number.
196 0xC4 Reallocation Event Count
Lower
Count of remap operations. The raw value of this attribute shows the total count of attempts to transfer data from reallocated sectors to a spare area. Both successful & unsuccessful attempts are counted.
197 0xC5 Current Pending Sector Count
Lower
Count of “unstable” sectors (waiting to be remapped, because of read errors). If an unstable sector is subsequently read successfully, this value is decreased and the sector is not remapped. Read errors on a sector will not remap the sector (since it might be readable later); instead, the drive firmware remembers that the sector needs to be remapped, and remaps it the next time it’s written.
198 0xC6 Uncorrectable Sector Count or

Offline Uncorrectable or

Off-Line Scan Uncorrectable Sector Count

Lower
The total count of uncorrectable errors when reading/writing a sector. A rise in the value of this attribute indicates defects of the disk surface and/or problems in the mechanical subsystem.
199 0xC7 UltraDMA CRC Error Count
Lower
The count of errors in data transfer via the interface cable as determined by ICRC (Interface Cyclic Redundancy Check).
200 0xC8 Multi-Zone Error Rate
Lower
The count of errors found when writing a sector. The higher the value, the worse the disk’s mechanical condition is.
200 0xC8 Write Error Rate (Fujitsu)
Lower
The total count of errors when writing a sector.
201 0xC9 Soft Read Error Rate or

TA Counter Detected

Lower
Count of off-track errors.
202 0xCA Data Address Mark errors or

TA Counter Increased

Lower
Count of Data Address Mark errors (or vendor-specific).
203 0xCB Run Out Cancel
Lower
Count of ECC errors
204 0xCC Soft ECC Correction
Lower
Count of errors corrected by software ECC
205 0xCD Thermal Asperity Rate (TAR)
Lower
Count of errors due to high temperature.
206 0xCE Flying Height Height of heads above the disk surface. A flying height that’s too low increases the chances of a head crash while a flying height that’s too high increases the chances of a read/write error.
207 0xCF Spin High Current
Lower
Amount of surge current used to spin up the drive.
208 0xD0 Spin Buzz Count of buzz routines needed to spin up the drive due to insufficient power.
209 0xD1 Offline Seek Performance Drive’s seek performance during its internal tests.
210 0xD2  ? (found in a Maxtor 6B200M0 200GB and Maxtor 2R015H1 15GB disks)
211 0xD3 Vibration During Write Vibration During Write
212 0xD4 Shock During Write Shock During Write
220 0xDC Disk Shift
Lower
Distance the disk has shifted relative to the spindle (usually due to shock or temperature). Unit of measure is unknown.
221 0xDD G-Sense Error Rate
Lower
The count of errors resulting from externally-induced shock & vibration.
222 0xDE Loaded Hours Time spent operating under data load (movement of magnetic head armature)
223 0xDF Load/Unload Retry Count Count of times head changes position.
224 0xE0 Load Friction
Lower
Resistance caused by friction in mechanical parts while operating.
225 0xE1 Load/Unload Cycle Count
Lower
Total count of load cycles
226 0xE2 Load ‘In’-time Total time of loading on the magnetic heads actuator (time not spent in parking area).
227 0xE3 Torque Amplification Count
Lower
Count of attempts to compensate for platter speed variations
228 0xE4 Power-Off Retract Cycle
Lower
The count of times the magnetic armature was retracted automatically as a result of cutting power.
230 0xE6 GMR Head Amplitude Amplitude of “thrashing” (distance of repetitive forward/reverse head motion)
231 0xE7 Temperature
Lower
Drive Temperature
232 0xE8 Endurance Remaining Number of physical erase cycles completed on the drive as a percentage of the maximum physical erase cycles the drive is designed to endure
232 0xE8 Available Reserved Space Intel SSD reports the number of available reserved space as a percentage of reserved space in a brand new SSD.
233 0xE9 Power-On Hours Number of hours elapsed in the power-on state.
233 0xE9 Media Wearout Indicator Intel SSD reports a normalized value of 100 (when the SSD is new) and declines to a minimum value of 1. It decreases while the NAND erase cycles increase from 0 to the maximum-rated cycles.
240 0xF0 Head Flying Hours Time while head is positioning
240 0xF0 Transfer Error Rate (Fujitsu) Count of times the link is reset during a data transfer.
241 0xF1 Total LBAs Written Total count of LBAs written
242 0xF2 Total LBAs Read Total count of LBAs read.
Some S.M.A.R.T. utilities will report a negative number for the raw value since in reality it has 48 bits rather than 32.
250 0xFA Read Error Retry Rate
Lower
Count of errors while reading from a disk
254 0xFE Free Fall Protection
Lower
Count of “Free Fall Events” detected

 
Zdroje:
http://en.wikipedia.org/wiki/S.M.A.R.T.
http://www.argusmonitor.com/en/smart.php
http://www.pc-king.co.uk/tips3.htm
http://karuppuswamy.com/wordpress/2010/05/19/how-to-predict-hard-disk-failure-in-ubuntu-with-3-clicks/


Komentáre

Povedzte nám čo si myslíte.

Pridaj komentár

Vyplňte formulár a odošlite