Crucial MX500 reports (presumably) false pending sectors

dr0 · Post by **dr0** » 2019.08.31. 17:50

Hello. A few weeks ago I bought myself a 1TB version of Crucial MX500 SSD. On 10th day of use, Hard Disk Sentinel reported that the drive health has dropped to 99%. At first, I thought that it was probably due to S.M.A.R.T. attribute #202 (Percentage Of The Rated Lifetime Used) increasing its value, which is normal for SSDs as NAND gets worn out by writes. But that wasn't the case. When the first health drop occurred, the SSD had only 500GB of lifetime writes, but then it got even weirder when the drive health jumped back to 100%. I checked the log and saw this:

Code: Select all

8/22/2019 1:21:07 AM,#197   Current Pending Sector Count  1 -> 0
8/22/2019 1:16:03 AM,#197   Current Pending Sector Count  0 -> 1

After that, the drive had 2 more events of Pending Sector going to 1 and then back to 0:

Code: Select all

8/27/2019 7:59:30 PM,#197   Current Pending Sector Count  1 -> 0
8/27/2019 7:54:25 PM,#197   Current Pending Sector Count  0 -> 1
8/24/2019 10:53:10 PM,#197   Current Pending Sector Count  1 -> 0
8/24/2019 10:48:06 PM,#197   Current Pending Sector Count  0 -> 1

After some googling, I learned that this is a weird quirk that plagues basically every MX500 SSD from Crucial: https://utcc.utoronto.ca/~cks/space/blo ... lakyErrors
Some people say that this is a firmware bug that unfortunately hasn't been fixed to this day.
If I had known this before I would probably have thought twice before purchasing this drive. But I got it for a really good price and this behavior doesn't seem to affect performance or data integrity or system stability in any way. Plus the drive comes with 5 years of warranty. So I decided not to return it, do regular backups and hope that someday Crucial will finally release a firmware that fixes this.

I was wondering if it's possible to configure Hard Disk Sentinel in a way so that it would ignore #197 Pending Sector Count attribute and wouldn't change the reported health percentage because of it and only keep saving the changes to the log and showing in the corresponding tab? Because the current situation doesn't seem to reflect the real condition:

Especially after I was experimenting with the program settings and the reported drive health has dropped to 50% when I tried to select "Analyse vendor-specific values" as health calculation method in Advanced Options:

Afterward, I switched back to the recommended method of health calculation but the health graph hasn't changed.
So, my question is: Is there a way to clear only the Health graph without losing other stats? And prevent the Health graph from changing in the future due to the Pending Sector going to 1 and then back to 0?

For now, I just set an offset of -1 for it, but I'm not sure how it will affect the Health graph and whether the changes still will be logged because I still would like to be able to read the log and see when exactly false pendings occurred.

Post by **hdsentinel** » 2019.09.03. 09:42

Thanks for your message.

As I know, the issue is related to the disk controller (motherboard chipset) driver which manages the SSD: in some motherboard/driver combinations the SSD records a pending sector sometimes on different events, usually on shut down. If possible, I'd check if an updated driver is available as it may make things better.

Yes, when Hard Disk Sentinel detects the change of the attribute, it does what it should: add the change to the Log page and also the Health % value reflects the change too. When the health improves back, the graph on that particular day(s) still show the low value, as you can see, because it designed to show the daily lowest health detected.

If you prefer, you can clear ALL statistics (including the Health % graph of all devices) any time if you completely close Hard Disk Sentinel by File -> Exit and delete the file HDSentinel.sta from the folder of the software. This clears all statistics. If you prefer to delete only some values of the appropriate device, it requires manual editing of the file, so please let me know, I can assist.

> For now, I just set an offset of -1 for it, but I'm not sure how it will affect the Health graph and whether the
> changes still will be logged because I still would like to be able to read the log and see when exactly false pendings occurred.

Yes, this is a good solution. Or you may uncheck the mark under "Enable" column on the S.M.A.R.T. page, next to the "Offset" value. This way that particular attribute will be not used to determine health, so its change will not affect the health %.

dr0 · Post by **dr0** » 2019.09.03. 12:58

If you prefer to delete only some values of the appropriate device, it requires manual editing of the file, so please let me know, I can assist.

I'd appreciate that as I really need your guidance here.
The drop to 50% somehow cleared itself after some time. Now I would like to only hide these 3 drops to 99%:

Or you may uncheck the mark under "Enable" column on the S.M.A.R.T. page, next to the "Offset" value. This way that particular attribute will be not used to determine health, so its change will not affect the health %.

Which method is better: setting it to -1 or completely unchecking it, if I know that it only changes by 1 digit and constantly goes up and then back down and if I want it not to affect the Health percentage but keep its changes recorded and displayed on the Log tab?

dr0 · Post by **dr0** » 2019.09.04. 20:59

Can you elaborate further on the subject of checkmarks on the S.M.A.R.T. tab? If, say, I was to remove a checkmark for #197 attribute and it would change in the future, would the changes still be displayed as a raw data on the S.M.A.R.T. tab? Or would it completely be ignored altogether and just always be displaying 0 even if that is not the case? I mean, what exactly that checkmark disables?

Post by **hdsentinel** » 2019.09.05. 10:34

The checkmark controls if the corresponding attribute should be used in determining the health value of the disk drive.
For example, if you un-mark the checkmark for attribute #197, then its possible change will not affect the health %.

The attribute with its current value is always displayed on the S.M.A.R.T. page and you can examine possible change on the graph below the attribute list, so you can see how it changes - just the change will not affect the health value.

Lucretius · Post by **Lucretius** » 2020.04.11. 15:54

This "Current Pending Sectors" (CPS) bug correlates perfectly with a more serious MX500 bug that causes premature death of the ssd. If you also analyze the "FTL Background NAND Page Writes" (FPW) SMART attribute, you'll see that CPS changes from 0 to 1 when the ssd firmware begins writing a HUGE amount of data -- a multiple of approximately 37000 NAND pages, approximately one GByte -- to the ssd NAND memory. (Presumably the firmware is moving data, and thus internally reading as much as it writes.) CPS changes back to 0 when the write burst completes. This writing is excessive and the excess unnecessarily consumes some of the finite number of NAND block erases that the ssd can endure, causing the Remaining Life to decrease faster than it ought to.

The problem is described in detail in a recent thread at the tomshardware.com forum: https://forums.tomshardware.com/thr...f ... n.3571220/ (If the thread's url isn't displayed properly here, google or search tomshardware for "Crucial MX500 500GB sata ssd Remaining Life decreasing fast despite few bytes being written".)

Another MX500 SMART attribute is "Host NAND Page Writes" (HPW). The ratio of FPW to HPW (plus 1) is called the Write Amplification Factor (WAF) and for the MX500 it's excessive. You should pay particular attention to your recent WAF -- the ratio of the increase of FPW to the increase of HPW (plus 1) over a recent period of time -- because the problem can grow worse after the ssd has been in service for a few months.

On my own MX500 ssd (500 GBytes), the recent WAF was 38.91 from Feb 6 2020 to Feb 22 2020 (when I started keeping a detailed log). That's outrageously high. The Average Block Erase Count (ABEC) SMART attribute was incrementing every day or two, even though my pc was writing to the ssd at an average rate of less than 100 kBytes/second. From Jan 15 to Feb 4, the ssd Remaining Life decreased 1% (from 94% to 93%) even though my pc wrote only 138 GB to the ssd.

People whose computers write a lot more to their MX500s than mine does may not notice that WAF is higher than it ought to be, ABEC increments faster than it ought to, and Remaining Life decreases faster than it ought to.

Software like HD Sentinel could be enhanced to detect and report excessive WAF. (I wrote my own .bat file to do this. It works by periodically executing the Smartmontools smartctl.exe tool to collect SMART data, analyzing the data, and appending the relevant SMART data and the WAF to a log file. Currently my pc runs two copies of the .bat: one that logs every two hours and the other daily. To observe the perfect correlation between the CPS bug and the FPW bursts, I logged every two seconds... it takes about 5 seconds for a 37000 page burst to complete.)

Crucial's tech support did not admit it's a bug, but they agreed to replace the ssd with a new one since they couldn't explain why WAF was so high. Eventually I'll probably begin the replacement process (they won't ship the replacement until after they receive the defective unit, a hassle) but I expect the replacement ssd will have the same bug.

Unclear from the perfect correlation is the cause and effect: whether the write burst causes CPS to change to 1 as a weird side effect, or whether the write burst is triggered by whatever it is (supposedly an error reading a sector, according to the definition of CPS) that causes CPS to change briefly to 1.

One final comment: I discovered a way to tame the excessive WAF. It appears that an ssd selftest runs at a higher priority than the buggy firmware background process that writes the huge bursts. I wrote another .bat that causes the ssd to perform selftests nearly nonstop (19.5 minutes of every 20 minutes) and this has reduced the average WAF to less than 3. The write bursts occur only during the 30 second pause between selftests, so there are fewer of them. (The reason I don't run the selftests nonstop is because I don't know whether that would prevent some necessary low priority processes from getting enough runtime.) I estimate it costs about one watt to do the selftests; they prevent the ssd from entering its low power state (which can be observed by the effect on the Power On Hours SMART attribute). The selftests raise the average temperature of the ssd by about 5 degrees C, to about 40C which is acceptable. I haven't yet benchmarked the ssd speed to see whether the selftests interfere with performance (maybe not since selftests are presumably a lower priority process than host reads and writes), or whether the avoidance of the low power state enhances performance.

Lucretius · Post by **Lucretius** » 2020.04.13. 05:38

This "Current Pending Sectors" (CPS) bug correlates perfectly with a more serious MX500 bug that causes premature death of the ssd. If you also analyze the "FTL Background NAND Page Writes" (FPW) SMART attribute, you'll see that CPS changes from 0 to 1 when the ssd firmware begins writing a HUGE amount of data -- a multiple of approximately 37000 NAND pages, approximately one GByte -- to the ssd NAND memory. (Presumably the firmware is moving data, and thus internally reading as much as it writes.) CPS changes back to 0 when the write burst completes. This writing is excessive and the excess unnecessarily consumes some of the finite number of NAND block erases that the ssd can endure, causing the Remaining Life to decrease faster than it ought to.

The problem is described in detail in a recent thread at the tomshardware.com forum: https://forums.tomshardware.com/threads ... n.3571220/ (If the thread's url isn't displayed completely, google or search tomshardware for "Crucial MX500 500GB sata ssd Remaining Life decreasing fast despite few bytes being written".)

sadifika · Post by **sadifika** » 2020.10.18. 14:02

Lucretius wrote:This "Current Pending Sectors" (CPS) bug correlates perfectly with a more serious MX500 bug that causes premature death of the ssd. If you also analyze the "FTL Background NAND Page Writes" (FPW) SMART attribute, you'll see that CPS changes from 0 to 1 when the ssd firmware begins writing a HUGE amount of data -- a multiple of approximately 37000 NAND pages, approximately one GByte -- to the ssd NAND memory. (Presumably the firmware is moving data, and thus internally reading as much as it writes.) CPS changes back to 0 when the write burst completes. This writing is excessive and the excess unnecessarily consumes some of the finite number of NAND block erases that the ssd can endure, causing the Remaining Life to decrease faster than it ought to.

Developers of smartmontools decided to ignore the Attribute #197 on MX500 SSDs recently, but they don't seem to be aware of the serious issues you mention. Maybe you should contact and explain them. (it's ticket 1227 on smartmontools bug tracker)

Hard Disk Monitoring

Crucial MX500 reports (presumably) false pending sectors

Crucial MX500 reports (presumably) false pending sectors

Re: Crucial MX500 reports (presumably) false pending sectors

Re: Crucial MX500 reports (presumably) false pending sectors

Re: Crucial MX500 reports (presumably) false pending sectors

Re: Crucial MX500 reports (presumably) false pending sectors

Re: Crucial MX500 reports (presumably) false pending sectors

Re: Crucial MX500 reports (presumably) false pending sectors

Re: Crucial MX500 reports (presumably) false pending sectors