Errors in Flags and descriptions of Smart Data

alan-0000 · Post by **alan-0000** » 2012.06.15. 13:05

re
viewtopic.php?f=32&t=1524&st=0&sk=t&sd=a

After starting a long topic on this forum as above, followed by a further long topic elsewhere,
I believe there are defects on the Smart Data display for at least my OCZ SSD, and probably for all SSD, and perhaps HDD needs a review.

Attribute no. 1 :- Raw Read Error Rate.
In the bottom right corner you list the Flags as
Error-Rate, Performance, Statistical, Critical.
I am advised by the OCZ Support manager that such errors are NOT critical but normal for any SSD. see post #14 at
http://www.ocztechnologyforum.com/forum ... inutes-Why

As for HD Sentinel.. NO OCZ is not setting any critical flag, that threshold is programmed into HD Sentinel and it would be bad on a normal HDD, but it isn't on an SSD.

Attribute no. 195 :- Every one but OCZ seems to get this one wrong.
Both Hard Disk Sentinel 4.00 PRO and CrystalDiskInfo 4.3.0 describe this as
"On-the-fly ECC UNcorrectable Error Count"
HD Tune 2.55 Hard Disk Utility puts an opposite spin on this and calls it
"Hardware ECC Recovered"
The official OCZ TOOLBOX SMART READ DATA working to Sandforce firmware specifications call this
"ECC On-the-fly Count Normalized Rate"
According to Sandforce this simply shows that Error Correcting Code detected an error that HAD A NEED for correcting but neither indicates success or failure of correction.
OCZ advise that correction failure is indicated by Attribute 187:
187: SSD Reported Uncorrectable Errors Uncorrectable RAISE errors reported to the host for all data access: 0

I have closely observed 3 instances, each lasting just over 4 minutes,
where creating a Macrium Reflect Partition Image backup file of the SSD, 11 GB of Used Space was compressed into a 6.5 GB backup file,
and each time as 11 GB was read, both 1:Raw Data and 195:"uncorrectable" increased by 25,000,000+.
I am guessing that each unit is not an erroneous 4 kB file cluster but a 64 bit word on my 64 bit system,
so 11 GB are read back with 200 MB of error.

Two percent of my operating system uses corrupt Raw Data and yet suffers no BSOD

I really do believe that every Raw Data error has always been corrected, as show by 187: value 0.

I suggest you remove the word UNcorrectable
You could follow "HD Tune 2.55 Hard Disk Utility" and use the word Recovered,
BUT ONLY if attribute 187: has not increased
N.B. Although 1: and 195: raw data values are always zero at startup,
and 187: is always zero at every time in my experience,
I understand from OCZ... posts #16 and #17 that if an error occurs and 187: increments it will never be zeroed by a power down or even a secure erase.

187 - you are correct. That point in time is from the 1st time the SSD was powered on, it should NEVER be reset by a power down or even a Secure erase.
It MAY be reset by a firmware update at some point, but up until now, none have required that level of "destructiveness".

N.B.
HDS and C.D.I. and HDTune show Raw Data values jumping from 0 up to 25,000,000+ as 11 GB is read.
The official OCZ ToolBox gadget shows a "Normalized Rate" that starts at 100 when no errors occurred, and rises up to 109 after Raw Data reaches 25,000,000.
I believe Sandforce have stipulated some bit-pattern to determine this,
and assume the possibility that Sandforce Marketing department decided against frightening users such as myself with big Raw Date values like 25,000,000+

This is the official Toolbox report which was created when HDS and the two other tools were reporting error counts of 25,000,000+
ModelNumber : OCZ-VERTEX2
Serial Number : OCZ-96FXFXCDVTA602Q9
WWN : 5-e8-3a-97 f8a9391b0

Revision: 10
Attributes List
1: SSD Raw Read Error Rate Normalized Rate: 109 total ECC and RAISE errors
5: SSD Retired Block Count Reserve blocks remaining: 100%
9: SSD Power-On Hours Total hours power on: 764
12: SSD Power Cycle Count Count of power on/off cycles: 403
171: SSD Program Fail Count Total number of Flash program operation failures: 1
172: SSD Erase Fail Count Total number of Flash erase operation failures: 0
174: SSD Unexpected power loss count Total number of unexpected power loss: 14
177: SSD Wear Range Delta Delta between most-worn and least-worn Flash blocks: 0
181: SSD Program Fail Count Total number of Flash program operation failures: 1
182: SSD Erase Fail Count Total number of Flash erase operation failures: 0
187: SSD Reported Uncorrectable Errors Uncorrectable RAISE errors reported to the host for all data access: 0
194: SSD Temperature Monitoring Current: 30 High: 30 Low: 30
195: SSD ECC On-the-fly Count Normalized Rate: 109
196: SSD Reallocation Event Count Total number of reallocated Flash blocks: 0
231: SSD Life Left Approximate SDD life Remaining: 100%
241: SSD Lifetime writes from host lifetime writes 64 GB
242: SSD Lifetime reads from host lifetime reads 512 GB

PLEASE NOTE :-
HDS starts up with Windows and right now shows the system has read 100 MB
HDS SMART is showing that both 1: and 195: have Value = 100 and Worst = 99 and data = 307AD4
OCX Toolbox reports
1: SSD Raw Read Error Rate Normalized Rate: 100 total ECC and RAISE errors
195: SSD ECC On-the-fly Count Normalized Rate: 100
I am guessing the normalized rate will reach 101 after the "data" which they do not reveal has going higher than 3249BA
(HDS value climbed whilst I launched Toolbox and typed results)

PLEASE DO NOT CONCEAL "data = 307AD4" etc.
I like all the data I can get,
my intention with this bug report is to ask that you refrain from saying that 25,000,000 errors were NOT corrected when 187:...0 shows they were corrected

Regards
Alan

Post by **hdsentinel** » 2012.06.15. 14:33

Dear Alan,

Thanks for your message and the information.

First of all, I need to confirm that the "flags" are provided by the SSD itself.
Hard Disk Sentinel reads this information from the SSD together with the S.M.A.R.T. attributes, so if "Critical" is displayed for any attribute, I can confirm it is designed by the manufacturer of the device - it is not a bug of Hard Disk Sentinel.
So even if they say this attribute is not critical (as it does not indicate critical information) according the tech support, it is marked as critical by design.
Excuse me for the confusion.

I agree that different tools may use slightly different naming for the attributes.
This is why important to check the SSD controller type and use the proper naming for the device.

Also as I tried to explain in the other topic, the "Raw read error rate" attribute on this SSD does NOT count the errors.
Its increase during read operations is completely normal as it counts the number of operations since start up (as you see, the counter is reset to zero on power cycle by the SSD itself, not by Hard Disk Sentinel or any other software).
I agree that this would indicate that the attribute is not really critical (as its increase is expected) but the attribute is anyway marked as "critical" by the manufacturer. Hard Disk Sentinel only displays what it reads from the device, it does not change / alter - so if you feel it is a bug, you may report to OCZ.

> Attribute no. 195 :- Every one but OCZ seems to get this one wrong.
> Both Hard Disk Sentinel 4.00 PRO and CrystalDiskInfo 4.3.0 describe this as
> "On-the-fly ECC UNcorrectable Error Count"
> HD Tune 2.55 Hard Disk Utility puts an opposite spin on this and calls it
> "Hardware ECC Recovered"
> The official OCZ TOOLBOX SMART READ DATA working to Sandforce firmware specifications call this
> "ECC On-the-fly Count Normalized Rate"

"Hardware ECC Recovered" naming is wrong for an SSD, it is only valid for hard disks.
"On-the-fly ECC UNcorrectable Error Count" is the proper name for the SSD.

As you can see in OCZ Toolbox, it is completely similar, except the words "Normalized rate".

> According to Sandforce this simply shows that Error Correcting Code detected an error
> that HAD A NEED for correcting but neither indicates success or failure of correction.

So it does not really indicate failure or problem conditions on this SSD.

> OCZ advise that correction failure is indicated by Attribute 187:
> 187: SSD Reported Uncorrectable Errors Uncorrectable RAISE errors reported to the host for all data access: 0

Yes, I completely agree. This attribute is very important in Hard Disk Sentinel for this SSD.

> I have closely observed 3 instances, each lasting just over 4 minutes,
> Two percent of my operating system uses corrupt Raw Data and yet suffers no BSOD

> I really do believe that every Raw Data error has always been corrected, as show by 187: value 0.

Of course no: as I wrote, the 25+ millions are NOT errors.

> I suggest you remove the word UNcorrectable
> You could follow "HD Tune 2.55 Hard Disk Utility" and use the word Recovered,

Excuse me, but HD Tune is wrong

We usually have the most recent attribute descriptions and other tools follow us with more or less success. Sorry!

> N.B. Although 1: and 195: raw data values are always zero at startup,

Yes, these values are cleared on power-cycle (on/off cycle).

> and 187: is always zero at every time in my experience,

This confirms that your SSD is fine - there are no errors.

> HDS starts up with Windows and right now shows the system has read 100 MB

Yes, this is possible: as after startup, it immediately reads the read counters since startup of the system.

> HDS SMART is showing that both 1: and 195: have Value = 100 and Worst = 99 and data = 307AD4
> OCX Toolbox reports
> 1: SSD Raw Read Error Rate Normalized Rate: 100 total ECC and RAISE errors
> 195: SSD ECC On-the-fly Count Normalized Rate: 100

Value in HDS = Normalized number in other tool.
The "Normalized" number is calculated somehow by the SSD (based on the raw "data"), but in most cases, we would need to examine the raw "data" field to get the proper count of errors. So if any tool ignores the raw "data" column and does not report it, it tries to "hide" the actual status.
This is especially true for tools made by manufacturers as they do not prefer to answer questions "why the values x and y increase" (exactly these questions raised here)

> PLEASE DO NOT CONCEAL "data = 307AD4" etc.
> I like all the data I can get,
> my intention with this bug report is to ask that you refrain from saying that 25,000,000 errors were NOT corrected when 187:...0 shows they were corrected

Excuse me but I still can't see the bug.
The attribute 1 is called "Raw read error rate", NOT "Raw reade error count". This means that the 25,000,000 is not the count of any kind of errors.

If you can use Report -> Send test report to developer option, I'd happily check the complete status of the SSD and may advise.

alan-0000 · Post by **alan-0000** » 2012.06.15. 20:12

hdsentinel wrote:Dear Alan,
I agree that this would indicate that the attribute is not really critical (as its increase is expected) but the attribute is anyway marked as "critical" by the manufacturer. Hard Disk Sentinel only displays what it reads from the device, it does not change / alter - so if you feel it is a bug, you may report to OCZ.

Some of my queries to OCZ on previous topics received little information because it was proprietary data that was retained by Sandforce who supply the Firmware.
I suspect that Non Disclosure Agreements may be in force.
I do not expect any benefit from reporting to OCZ so I will accept that "critical" is meaningless and worry about something else.

Excuse me but I still can't see the bug.
The attribute 1 is called "Raw read error rate", NOT "Raw reade error count". This means that the 25,000,000 is not the count of any kind of errors.

I ONLY thought attribute 1 MIGHT be important because you reported this as Critical.
I ALWAYS thought that Raw Data errors were totally harmless apart from needing a few more logic clock cycles to correct the errors,
UNLESS the errors are so bad they cannot be corrected.

Attribute 195 is the killer when it tells me in sshot-49.gif
On-the-fly ECC Uncorrectable Error Count 0 109 104 ... 000001832E53
and displays 25,374,291 as the base 10 equivalent.

My focus is on Uncorrectable which told me a very bad thing happened 25,000,000+ times,
whereas in fact 25,000,000+ lumps of data needed correction and were in fact fully corrected.

I am happy to accept that an official industry standard for Smart attribute 195 title may be
"On-the-fly ECC UNcorrectable Error Count"
but think the official OCZ Toolbox name is more appropriate and less scary, i.e.
"ECC On-the-fly Count Normalized Rate"

SSHOT-49.gif was captured today immediately after Macrium Reflect imaged 11 GB of used space on the SSD
which was seen by HDS on Disk Performance as reading 24,906 MB
( I guess that each MB that is read from the SSD "LBA" is buffered on C:\ and read from the buffer whilst being compressed )

Regards
Alan

Post by **hdsentinel** » 2012.06.18. 12:31

Dear Alan,

Excuse me for the confusion.
In general, the attribute "Raw read error rate" is critical and marked as critical by design (even if OCZ support says different).

Let's investigate this attribute more closer in Hard Disk Sentinel:

1 Raw Read Error Rate Threshold = 50 Value = 100 Worst = 99 Data = 0000002CB11D
(on the image you posted on the different topic).

It means that when the Value (100) (which is called as "Normalized Rate" in other tool) drops below the Threshold (50), then the SSD is considered to be failing in less than 24 hours and you should ask for warranty replacement (if there's still in the warranty period).

The term "Raw read error rate" suggests that the Value (100) reflects the rate of errors compared to the total number of read operations, reflected in the "Data" column. As you may see, there is an entry called Worst = 99, showing that the lowest Value (the worst in this term) was 99 in the lifetime of the SSD - which is still far from the Threshold (50) defined by the manufacturer.

So if OCZ says that this attribute is not critical, they simply do not know the meaning of it.

I can imagine that they would say that the "Data" column (which reflects the read operations, not errors) is not critical. They know that this may confuse users and this is why they do not prefer to display this information in their tools, hiding most of the valuable status information.

I completely agree: data errors are really important - but this is not reported in the Data column of "1 Raw Read Error Rate" attribute of this SSD.

The same is true for attribute 195.
The other tool displays that "ECC On-the-fly Count Normalized Rate" because it displays only the "Value" field (which is called Normalized Rate by that tool).
This is rather confusing and can cause confusion, because the term "Normalized Rate" could be displayed to all other attributes - because they also show the "Normalized Rate" field only.
This is why using that tool is more than confusing - it is not consistent and displays limited information.

Personally I feel the official term (used in several other tools, not only in Hard Disk Sentinel) which is suggested by the official Sandforce specifications are better than the interpretation used by OCZ

alan-0000 · Post by **alan-0000** » 2012.06.18. 14:02

Thanks

My focus is just on 195.

When I saw 25,000,000 UNCorrectable errors as the C:\ partition was imaged,
I understood that should I restore that image I would be stuck with 25,000,000 off erroneous 64 bit words because ECC had failed.

So far as OCZ and Sandforce are concerned 195 merely counts how many words had "raw errors" which merited correction by ECC,
and does not show whether ECC succeeded or failed.
OCZ and Sandforce only give a verdict in Attribute 187 in which is stated a non-resettable total of fatal errors which could NOT be corrected.

I have never seen any difference at all between the Raw Errors of attribute 1 and the UNCorrectible errors of attribute 195.
QUESTIONS :-
Is this peculiar to Sandforce SSD controllers ?
Or do alternative SSD's have large differences between these attributes ?

I have a sneaky suspicion that Sandforce made a tiny mistake in their firmware and the wrong number appears in the data number,
and Sandforce compensate by telling OCZ to block their toolkit from displaying an embarrassingly wrong number.

Regards
Alan

Post by **hdsentinel** » 2012.06.18. 14:32

This is similar on SSDs with Sandforce controllers, for example, an ADATA S599 (using the same SF 1200 series controller) the attributes #1 and #195 change exactly the way you described.

I do not really think it is a firmware bug: they just want to publish that the "Value" field is the interesting (that should reflect the problems for both #1 and #195 attributes). As you wrote, I agree that they completely want to hide the total number of operations (eg. the "Data" field) but this way they hide valuable information, for example the lifetime data transfer, start/stop count, etc... also which can be collected from the "Data" field.

For both #1 and #195, we agree that the "Value" is the interesting part on this SSD.

alan-0000 · Post by **alan-0000** » 2012.06.18. 19:20

Thanks

Agreed

Regards
Alan

Hard Disk Monitoring

Errors in Flags and descriptions of Smart Data

Errors in Flags and descriptions of Smart Data

Re: Errors in Flags and descriptions of Smart Data

Re: Errors in Flags and descriptions of Smart Data

Re: Errors in Flags and descriptions of Smart Data

Re: Errors in Flags and descriptions of Smart Data

Re: Errors in Flags and descriptions of Smart Data

Re: Errors in Flags and descriptions of Smart Data