Server crash. UPDATE: New server ordered

User avatar
webwit
Wild Duck

18 Jul 2017, 09:06

The server crashed this morning, sometime after 6:00 CET (nightly backup completed). I couldn't ssh into the server even after a restart. I used a rescue system by our hoster which uses PXE boot and runs in the memory of the server. Mounted raid disk, used fsck to repair disk (it found variour errors). Now it's up. Currently checking database tables for any corruptions.

UPDATE 19 July 20:10 UTC:
The server and thus deskthority will be down from 22:25 UTC July 19th (00:25 CEST July 20th, 18:25 EST July 19th, 15:25 PST July 19th) for an estimated 30 to 45 minutes, for a health check of our hard drives. This is 2 hours and 15 minutes from now. See you on the other side of the event horizon!

User avatar
Wodan
ISO Advocate

18 Jul 2017, 09:09

Did you SMART query the HDD drives? :(

User avatar
webwit
Wild Duck

18 Jul 2017, 09:13

I only found 1 error in the database, in some not important visitor statistics table, which I repaired. There was also an error earlier in a wiki table, which caused the wiki db backup not to complete the past four days. However, the fsck seemed to have repaired it.

Please let me know if you find any missing data or if other things don't work as they should.
Wodan wrote: Did you SMART query the HDD drives? :(
I don't even know what a SMART query is. :? I'm not much of a linux admin.

User avatar
kbdfr
The Tiproman

18 Jul 2017, 10:02

webwit wrote: The server crashed this morning, sometime after 6:00 CET (nightly backup completed). I couldn't ssh into the server even after a restart. I used a rescue system by our hoster which uses PXE boot and runs in the memory of the server. Mounted raid disk, used fsck to repair disk (it found variour errors). Now it's up. Currently checking database tables for any corruptions.
Thanks for fixing that,
even if I don't understand much of what you did :lol:

User avatar
chuckdee

18 Jul 2017, 11:10

webwit wrote: I only found 1 error in the database, in some not important visitor statistics table, which I repaired. There was also an error earlier in a wiki table, which caused the wiki db backup not to complete the past four days. However, the fsck seemed to have repaired it.

Please let me know if you find any missing data or if other things don't work as they should.
Wodan wrote: Did you SMART query the HDD drives? :(
I don't even know what a SMART query is. :? I'm not much of a linux admin.
https://en.wikipedia.org/wiki/S.M.A.R.T.

It allows you to predict whether a drive is in danger of failing. I don't know how to do it on Linux, though.

User avatar
matt3o
-[°_°]-

18 Jul 2017, 11:12

webwit wrote: I don't even know what a SMART query is. :? I'm not much of a linux admin.
run:

smartctl -t short /dev/sdX

(where X is the drive)

the test will take few minutes. If the test is successful run

smartctl -t long /dev/sdX

this will take much longer.

If any of the above fails, copy the test result, send to the host and ask for a replacement.

To check test status run:

smartctl -l selftest /dev/sdX

User avatar
webwit
Wild Duck

18 Jul 2017, 11:30

Thanks. Short tests completed without error. Currently long testing first raid disk. "Please wait 316 minutes for test to complete" ...

User avatar
Wodan
ISO Advocate

18 Jul 2017, 11:41

webwit wrote: Thanks. Short tests completed without error. Currently long testing first raid disk. "Please wait 316 minutes for test to complete" ...
Thanks very much for taking care of this!
Some linux distributions have some kind of SMART check daemon that you can configure to run periodically and send you new test results.

User avatar
DanielT
Un petit village gaulois d'Armorique…

18 Jul 2017, 11:48

What maker is the server ? Depending on the manufacturer there are some tools that can be used and are way better than the smart stuff . This is what I do for a living by the way :P

User avatar
webwit
Wild Duck

18 Jul 2017, 12:02

DanielT wrote: What maker is the server ? Depending on the manufacturer there are some tools that can be used and are way better than the smart stuff . This is what I do for a living by the way :P

Code: Select all

>lshw 
server.deskthority.net
    description: Desktop Computer
    product: MS-7823 (To be filled by O.E.M.)
    vendor: MSI
    version: 1.0
    serial: To be filled by O.E.M.
    width: 64 bits
    capabilities: smbios-2.8 dmi-2.7 vsyscall64 vsyscall32
    configuration: administrator_password=disabled boot=normal chassis=desktop family=To be filled by O.E.M. frontpanel_passw
ord=disabled keyboard_password=disabled power-on_password=disabled sku=To be filled by O.E.M. uuid=00000000-0000-0000-0000-44
8A5BD4482E
  *-core
       description: Motherboard
       product: B85M-G43 (MS-7823)
       vendor: MSI
       physical id: 0
       version: 1.0
       serial: To be filled by O.E.M.
       slot: To be filled by O.E.M.
     *-firmware
          description: BIOS
          vendor: American Megatrends Inc.
          physical id: 0
          version: V3.14B3
          date: 06/23/2014
          size: 64KiB
          capacity: 15MiB
          capabilities: pci upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy288
0 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification uefi
     *-cpu
          description: CPU
          product: Xeon (Fill By OEM)
          vendor: Intel Corp.
          vendor_id: GenuineIntel
          physical id: 3d
          bus info: cpu@0
          version: Intel(R) Xeon(R) CPU E3-1246 v3 @ 3.50GHz
          slot: SOCKET 0
          size: 3500MHz
          capacity: 3900MHz
          width: 64 bits
          clock: 100MHz
          capabilities: x86-64 fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflu
sh dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp constant_tsc arch_perfmon pebs bts rep_good xtopology no
nstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic m
ovbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat xsaveopt pln pts dtherm tpr_shadow vnmi flexpri
ority ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm cpufreq
          configuration: cores=4 enabledcores=4 threads=8
        *-cache:0
             description: L2 cache
             physical id: 3e
             slot: CPU Internal L2
             size: 1MiB
             capacity: 1MiB
             capabilities: internal write-back unified
        *-cache:1
             description: L1 cache
             physical id: 3f
             slot: CPU Internal L1
             size: 256KiB
             capacity: 256KiB
             capabilities: internal write-back
        *-cache:2
             description: L3 cache
             physical id: 40
             slot: CPU Internal L3
             size: 8MiB
             capacity: 8MiB
             capabilities: internal write-back unified
     *-memory
          description: System Memory
          physical id: 41
          slot: System board or motherboard
          size: 32GiB
        *-bank:0
             description: DIMM DDR3 Synchronous 1600 MHz (0.6 ns)
             product: CT102464BA160B.C16
             vendor: Conexant (Rockwell)
             physical id: 0
             serial: AE008BBA
             slot: ChannelA-DIMM0
             size: 8GiB
             width: 64 bits
             clock: 1600MHz (0.6ns)
        *-bank:1
             description: DIMM DDR3 Synchronous 1600 MHz (0.6 ns)
             product: CT102464BA160B.C16
             vendor: Conexant (Rockwell)
             physical id: 1
             serial: A41163FD
             slot: ChannelA-DIMM1
             size: 8GiB
             width: 64 bits
             clock: 1600MHz (0.6ns)
        *-bank:2
             description: DIMM DDR3 Synchronous 1600 MHz (0.6 ns)
             product: CT102464BA160B.C16
             vendor: Conexant (Rockwell)
             physical id: 2
             serial: A10FE015
             slot: ChannelB-DIMM0
             size: 8GiB
             width: 64 bits
             clock: 1600MHz (0.6ns)
        *-bank:3
             description: DIMM DDR3 Synchronous 1600 MHz (0.6 ns)
             product: CT102464BA160B.C16
             vendor: Conexant (Rockwell)
             physical id: 3
             serial: AE008BB9
             slot: ChannelB-DIMM1
             size: 8GiB
             width: 64 bits
             clock: 1600MHz (0.6ns)
     *-pci
          description: Host bridge
          product: Xeon E3-1200 v3 Processor DRAM Controller
          vendor: Intel Corporation
          physical id: 100
          bus info: pci@0000:00:00.0
          version: 06
          width: 32 bits
          clock: 33MHz
        *-display UNCLAIMED
             description: VGA compatible controller
             product: Xeon E3-1200 v3 Processor Integrated Graphics Controller
             vendor: Intel Corporation
             physical id: 2
             bus info: pci@0000:00:02.0
             version: 06
             width: 64 bits
             clock: 33MHz
             capabilities: msi pm vga_controller bus_master cap_list
             configuration: latency=0
             resources: memory:f7800000-f7bfffff memory:e0000000-efffffff(prefetchable) ioport:f000(size=64)
        *-usb:0
             description: USB controller
             product: 8 Series/C220 Series Chipset Family USB xHCI
             vendor: Intel Corporation
             physical id: 14
             bus info: pci@0000:00:14.0
             version: 05
             width: 64 bits
             clock: 33MHz
             capabilities: pm msi xhci bus_master cap_list
             configuration: driver=xhci_hcd latency=0
             resources: irq:33 memory:f7d00000-f7d0ffff
           *-usbhost:0
                product: xHCI Host Controller
                vendor: Linux 2.6.32-696.3.2.el6.x86_64 xhci_hcd
                physical id: 0
                bus info: usb@4
                logical name: usb4
                version: 2.06
                capabilities: usb-3.00
                configuration: driver=hub slots=6 speed=5000Mbit/s
           *-usbhost:1
                product: xHCI Host Controller
                vendor: Linux 2.6.32-696.3.2.el6.x86_64 xhci_hcd
                physical id: 1
                bus info: usb@3
                logical name: usb3
                version: 2.06
                capabilities: usb-2.00
                configuration: driver=hub slots=12 speed=480Mbit/s
        *-communication UNCLAIMED
             description: Communication controller
             product: 8 Series/C220 Series Chipset Family MEI Controller #1
             vendor: Intel Corporation
             physical id: 16
             bus info: pci@0000:00:16.0
             version: 04
             width: 64 bits
             clock: 33MHz
             capabilities: pm msi bus_master cap_list
             configuration: latency=0
             resources: memory:f7d16000-f7d1600f
        *-usb:1
             description: USB controller
             product: 8 Series/C220 Series Chipset Family USB EHCI #2
             vendor: Intel Corporation
             physical id: 1a
             bus info: pci@0000:00:1a.0
             version: 05
             width: 32 bits
             clock: 33MHz
             capabilities: pm debug ehci bus_master cap_list
             configuration: driver=ehci_hcd latency=0
             resources: irq:20 memory:f7d14000-f7d143ff
           *-usbhost
                product: EHCI Host Controller
                vendor: Linux 2.6.32-696.3.2.el6.x86_64 ehci_hcd
                physical id: 1
                bus info: usb@1
                logical name: usb1
                version: 2.06
                capabilities: usb-2.00
                configuration: driver=hub slots=2 speed=480Mbit/s
              *-usb
                   description: USB hub
                   vendor: Intel Corp.
                   physical id: 1
                   bus info: usb@1:1
                   version: 0.05
                   capabilities: usb-2.00
                   configuration: driver=hub slots=6 speed=480Mbit/s
        *-pci:0
             description: PCI bridge
             product: 8 Series/C220 Series Chipset Family PCI Express Root Port #1
             vendor: Intel Corporation
             physical id: 1c
             bus info: pci@0000:00:1c.0
             version: d5
             width: 32 bits
             clock: 33MHz
             capabilities: pci pciexpress msi pm normal_decode bus_master cap_list
             configuration: driver=pcieport
             resources: irq:31 ioport:2000(size=4096) memory:df200000-df3fffff ioport:df400000(size=2097152)
        *-pci:1
             description: PCI bridge
             product: 8 Series/C220 Series Chipset Family PCI Express Root Port #5
             vendor: Intel Corporation
             physical id: 1c.4
             bus info: pci@0000:00:1c.4
             version: d5
             width: 32 bits
             clock: 33MHz
             capabilities: pci pciexpress msi pm normal_decode bus_master cap_list
             configuration: driver=pcieport
             resources: irq:32 ioport:e000(size=4096) memory:f7c00000-f7cfffff ioport:f0000000(size=1048576)
           *-network
                description: Ethernet interface
                product: RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
                vendor: Realtek Semiconductor Co., Ltd.
                physical id: 0
                bus info: pci@0000:02:00.0
                logical name: eth0
                version: 0c
                serial: 44:8a:5b:d4:48:2e
                size: 1Gbit/s
                capacity: 1Gbit/s
                width: 64 bits
                clock: 33MHz
                capabilities: pm msi pciexpress msix vpd bus_master cap_list ethernet physical tp mii 10bt 10bt-fd 100bt 100b
t-fd 1000bt 1000bt-fd autonegotiation
                configuration: autonegotiation=on broadcast=yes driver=r8169 driverversion=2.3LK-NAPI duplex=full firmware=rt
l8168g-2_0.0.1 02/06/13 ip=136.243.20.197 latency=0 link=yes multicast=yes port=MII speed=1Gbit/s
                resources: irq:35 ioport:e000(size=256) memory:f7c00000-f7c00fff memory:f0000000-f0003fff(prefetchable)
        *-usb:2
             description: USB controller
             product: 8 Series/C220 Series Chipset Family USB EHCI #1
             vendor: Intel Corporation
             physical id: 1d
             bus info: pci@0000:00:1d.0
             version: 05
             width: 32 bits
             clock: 33MHz
             capabilities: pm debug ehci bus_master cap_list
             configuration: driver=ehci_hcd latency=0
             resources: irq:23 memory:f7d13000-f7d133ff
           *-usbhost
                product: EHCI Host Controller
                vendor: Linux 2.6.32-696.3.2.el6.x86_64 ehci_hcd
                physical id: 1
                bus info: usb@2
                logical name: usb2
                version: 2.06
                capabilities: usb-2.00
                configuration: driver=hub slots=2 speed=480Mbit/s
              *-usb
                   description: USB hub
                   vendor: Intel Corp.
                   physical id: 1
                   bus info: usb@2:1
                   version: 0.05
                   capabilities: usb-2.00
                   configuration: driver=hub slots=6 speed=480Mbit/s
        *-isa
             description: ISA bridge
             product: B85 Express LPC Controller
             vendor: Intel Corporation
             physical id: 1f
             bus info: pci@0000:00:1f.0
             version: 05
             width: 32 bits
             clock: 33MHz
             capabilities: isa bus_master cap_list
             configuration: driver=lpc_ich latency=0
             resources: irq:0
        *-storage
             description: SATA controller
             product: 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode]
             vendor: Intel Corporation
             physical id: 1f.2
             bus info: pci@0000:00:1f.2
             logical name: scsi0
             logical name: scsi1
             version: 05
             width: 32 bits
             clock: 66MHz
             capabilities: storage msi pm ahci_1.0 bus_master cap_list emulated
             configuration: driver=ahci latency=0
             resources: irq:34 ioport:f0b0(size=8) ioport:f0a0(size=4) ioport:f090(size=8) ioport:f080(size=4) ioport:f060(si
ze=32) memory:f7d12000-f7d127ff
           *-disk:0
                description: ATA Disk
                product: HGST HUS724020AL
                physical id: 0
                bus info: scsi@0:0.0.0
                logical name: /dev/sda
                version: AA70
                serial: PN1134P6JHVLRS
                size: 1863GiB (2TB)
                capabilities: partitioned partitioned:dos
                configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512 signature=00065bc0
              *-volume:0
                   description: Linux swap volume
                   physical id: 1
                   bus info: scsi@0:0.0.0,1
                   logical name: /dev/sda1
                   version: 1
                   serial: 804f02eb-215f-4002-aa3f-87fd72ea6f60
                   size: 15GiB
                   capacity: 16GiB
                   capabilities: primary multi swap initialized
                   configuration: filesystem=swap pagesize=4096
              *-volume:1
                   description: EXT3 volume
                   vendor: Linux
                   physical id: 2
                   bus info: scsi@0:0.0.0,2
                   logical name: /dev/sda2
                   version: 1.0
                   serial: f431e2a6-4fdc-4fab-8f06-d6df688fd284
                   size: 511MiB
                   capacity: 512MiB
                   capabilities: primary multi journaled extended_attributes recover ext3 ext2 initialized
                   configuration: created=2015-05-12 12:29:47 filesystem=ext3 lastmountpoint=/installimage.rMbiM/hdd/boot mod
ified=2017-07-18 06:47:12 mounted=2017-07-18 06:47:12 state=clean
              *-volume:2
                   description: EXT4 volume
                   vendor: Linux
                   physical id: 3
                   bus info: scsi@0:0.0.0,3
                   logical name: /dev/sda3
                   version: 1.0
                   serial: eb9ef821-4f84-49a1-9c52-6aa817286fec
                   size: 1846GiB
                   capacity: 1846GiB
                   capabilities: primary multi journaled extended_attributes large_files huge_files dir_nlink recover extents
 ext4 ext2 initialized
                   configuration: created=2015-05-12 12:29:57 filesystem=ext4 lastmountpoint=/ modified=2017-07-18 06:46:09 m
ounted=2017-07-18 06:47:12 state=clean
           *-disk:1
                description: ATA Disk
                product: ST2000NM0033-9ZM
                vendor: Seagate
                physical id: 1
                bus info: scsi@1:0.0.0
                logical name: /dev/sdb
                version: SN03
                serial: Z1X0CRNR
                size: 1863GiB (2TB)
                capabilities: partitioned partitioned:dos
                configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512 signature=000b1474
              *-volume:0
                   description: Linux swap volume
                   physical id: 1
                   bus info: scsi@1:0.0.0,1
                   logical name: /dev/sdb1
                   version: 1
                   serial: 804f02eb-215f-4002-aa3f-87fd72ea6f60
                   size: 15GiB
                   capacity: 16GiB
                   capabilities: primary multi swap initialized
                   configuration: filesystem=swap pagesize=4096
              *-volume:1
                   description: EXT3 volume
                   vendor: Linux
                   physical id: 2
                   bus info: scsi@1:0.0.0,2
                   logical name: /dev/sdb2
                   version: 1.0
                   serial: f431e2a6-4fdc-4fab-8f06-d6df688fd284
                   size: 511MiB
                   capacity: 512MiB
                   capabilities: primary multi journaled extended_attributes recover ext3 ext2 initialized
                   configuration: created=2015-05-12 12:29:47 filesystem=ext3 lastmountpoint=/installimage.rMbiM/hdd/boot mod
ified=2017-07-18 06:47:12 mounted=2017-07-18 06:47:12 state=clean
              *-volume:2
                   description: EXT4 volume
                   vendor: Linux
                   physical id: 3
                   bus info: scsi@1:0.0.0,3
                   logical name: /dev/sdb3
                   version: 1.0
                   serial: eb9ef821-4f84-49a1-9c52-6aa817286fec
                   size: 1846GiB
                   capacity: 1846GiB
                   capabilities: primary multi journaled extended_attributes large_files huge_files dir_nlink recover extents
 ext4 ext2 initialized
                   configuration: created=2015-05-12 12:29:57 filesystem=ext4 lastmountpoint=/ modified=2017-07-18 06:46:09 m
ounted=2017-07-18 06:47:12 state=clean
        *-serial UNCLAIMED
             description: SMBus
             product: 8 Series/C220 Series Chipset Family SMBus Controller
             vendor: Intel Corporation
             physical id: 1f.3
             bus info: pci@0000:00:1f.3
             version: 05
             width: 64 bits
             clock: 33MHz
             configuration: latency=0
             resources: memory:f7d11000-f7d110ff ioport:f040(size=32)
  *-power UNCLAIMED
       description: To Be Filled By O.E.M.
       product: To Be Filled By O.E.M.
       vendor: To Be Filled By O.E.M.
       physical id: 1
       version: To Be Filled By O.E.M.
       serial: To Be Filled By O.E.M.
       capacity: 32768mWh

User avatar
DanielT
Un petit village gaulois d'Armorique…

18 Jul 2017, 12:29

Ok, with that configuration SMART is the only way.

User avatar
webwit
Wild Duck

18 Jul 2017, 15:02

Doesn't look well, the database crashed again. Rest of the server was still up. I'll wait for the smartctl results.

User avatar
matt3o
-[°_°]-

18 Jul 2017, 15:21

webwit wrote: Doesn't look well, the database crashed again. Rest of the server was still up. I'll wait for the smartctl results.
want me to check the DB config?

User avatar
webwit
Wild Duck

18 Jul 2017, 15:30

I don't think it's the config but just hd corruptions? But feel free to check. Smartctl is still at 70% remaining...

User avatar
wobbled

18 Jul 2017, 16:00

Might be worth checking things such as the disk latency which is much quicker than a smart query and will generally tell you if your HDD's are on their way out.

User avatar
webwit
Wild Duck

18 Jul 2017, 16:11

If this happens again I might just move the entire thing to a fresh server. Last time I did that it only was one or two clicks with cPanel WHM, moving over website, db, mail, dns etc. And we'll get some newer hardware for the same price.

User avatar
wobbled

18 Jul 2017, 16:16

webwit wrote: If this happens again I might just move the entire thing to a fresh server. Last time I did that it only was one or two clicks with cPanel WHM, moving over website, db, mail, dns etc. And we'll get some newer hardware for the same price.
If possible go entirely solid state with a new server, HDD's are a ticking time bomb honestly.

User avatar
webwit
Wild Duck

18 Jul 2017, 16:18

That's what we had last time. It crashed. Also, it was small compared to HDD, we need capacity.

User avatar
matt3o
-[°_°]-

19 Jul 2017, 08:09

how did the test go?

User avatar
webwit
Wild Duck

19 Jul 2017, 09:07

No errors on the first disk, now checking the second.

Edit: Second disk also without errors.

User avatar
matt3o
-[°_°]-

19 Jul 2017, 15:11

webwit wrote: No errors on the first disk, now checking the second.

Edit: Second disk also without errors.
that is really weird. let me check the mysql config

User avatar
webwit
Wild Duck

19 Jul 2017, 16:30

There's one strange thing. /dev/sda took a lot longer than /dev/sdb, like 12 hours vs a couple of hours, while these are (I think) identical disks.
With both sda and sdb, the process (like "70% remaining") is shown with smartctl -c /dev/sdX. But with smartctl -l selftest /dev/sdX, it only showed the test was in progress for X% for sdb under Num #1, not sda (see below).

Code: Select all

root@server [~]# smartctl -c /dev/sda
smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-696.3.2.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 247) Self-test routine in progress...
                                        70% of test remaining.
Total time to complete Offline
data collection:                (   24) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 316) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

root@server [~]# smartctl -l selftest /dev/sda
smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-696.3.2.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     22554         -
# 2  Short offline       Completed without error       00%     22542         -
# 3  Extended offline    Completed without error       00%      3393         -
# 4  Extended offline    Completed without error       00%      3316         -
# 5  Extended offline    Completed without error       00%        21         -
# 6  Extended offline    Completed without error       00%         4         -

root@server [~]#

User avatar
matt3o
-[°_°]-

19 Jul 2017, 17:25

would you post "smartctl -a /dev/sdX" for both drives? honestly 12 hours seems way too much for a smartctl test.

I had a look at the mysql config, it can be improved but I don't see anything too bad

User avatar
webwit
Wild Duck

19 Jul 2017, 17:38

/dev/sda:

Code: Select all

root@server [~]# smartctl -a /dev/sda
smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-696.3.2.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi/HGST Ultrastar 7K4000
Device Model:     HGST HUS724020ALA640
Serial Number:    PN1134P6JHVLRS
LU WWN Device Id: 5 000cca 22de36471
Firmware Version: MF6OAA70
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Wed Jul 19 15:35:25 2017 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 246) Self-test routine in progress...
                                        60% of test remaining.
Total time to complete Offline
data collection:                (   24) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 316) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   138   138   054    Pre-fail  Offline      -       76
  3 Spin_Up_Time            0x0007   152   152   024    Pre-fail  Always       -       456 (Average 364)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       10
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   142   142   020    Pre-fail  Offline      -       25
  9 Power_On_Hours          0x0012   097   097   000    Old_age   Always       -       22572
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       10
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       222
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       222
194 Temperature_Celsius     0x0002   176   176   000    Old_age   Always       -       34 (Min/Max 23/40)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     22554         -
# 2  Short offline       Completed without error       00%     22542         -
# 3  Extended offline    Completed without error       00%      3393         -
# 4  Extended offline    Completed without error       00%      3316         -
# 5  Extended offline    Completed without error       00%        21         -
# 6  Extended offline    Completed without error       00%         4         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
/dev/sdb:

Code: Select all

root@server [~]# smartctl -a /dev/sdb
smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-696.3.2.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Constellation ES.3
Device Model:     ST2000NM0033-9ZM175
Serial Number:    Z1X0CRNR
LU WWN Device Id: 5 000c50 03cecc3c6
Firmware Version: SN03
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ACS-2 (revision not indicated)
Local Time is:    Wed Jul 19 15:34:43 2017 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  592) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 254) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x50bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   078   063   044    Pre-fail  Always       -       67401833
  3 Spin_Up_Time            0x0003   096   096   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       7
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   094   060   030    Pre-fail  Always       -       3001257799
  9 Power_On_Hours          0x0032   075   075   000    Old_age   Always       -       22094
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       7
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   067   059   045    Old_age   Always       -       33 (Min/Max 31/33)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       5
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       946
194 Temperature_Celsius     0x0022   033   041   000    Old_age   Always       -       33 (0 22 0 0 0)
195 Hardware_ECC_Recovered  0x001a   046   015   000    Old_age   Always       -       67401833
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     22091         -
# 2  Short offline       Completed without error       00%     22064         -
# 3  Extended offline    Completed without error       00%      2914         -
# 4  Extended offline    Completed without error       00%      2838         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
PS: Currently re-testing /dev/sda

User avatar
Wodan
ISO Advocate

19 Jul 2017, 17:54

Okay I am not the SMART readout expert but this is worrying me (sdb)

Code: Select all

  1 Raw_Read_Error_Rate     0x000f   078   063   044    Pre-fail  Always       -       67401833
HDDs are in a RAID config?

Might make sense to request a replacement of SDB, rebuild the raid and then have SDA replaced ?

User avatar
matt3o
-[°_°]-

19 Jul 2017, 18:08

/dev/sdb seems to be deteriorating. If you look at smartctl -a every 5-10 minutes does the column "VALUE" lower over time?

User avatar
webwit
Wild Duck

19 Jul 2017, 19:32

Code: Select all

root@server [~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1]
      524224 blocks super 1.0 [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
      16777088 blocks super 1.0 [2/2] [UU]

md2 : active raid1 sda3[0] sdb3[1]
      1936208832 blocks super 1.0 [2/2] [UU]

unused devices: <none>

Code: Select all

root@server [~]# dmesg | grep raid
md: raid1 personality registered for level 1
md/raid1:md2: not clean -- starting background reconstruction
md/raid1:md2: active with 2 out of 2 mirrors
md/raid1:md1: active with 2 out of 2 mirrors
md/raid1:md0: active with 2 out of 2 mirrors

Code: Select all

root@server [~]# while true; do smartctl -a /dev/sdb |grep Raw_Read_Error_Rate; sleep 300; done
  1 Raw_Read_Error_Rate     0x000f   078   063   044    Pre-fail  Always       -       68872755
  1 Raw_Read_Error_Rate     0x000f   078   063   044    Pre-fail  Always       -       69089054
  1 Raw_Read_Error_Rate     0x000f   078   063   044    Pre-fail  Always       -       69227460
  1 Raw_Read_Error_Rate     0x000f   078   063   044    Pre-fail  Always       -       69271452
  1 Raw_Read_Error_Rate     0x000f   078   063   044    Pre-fail  Always       -       69324943
  1 Raw_Read_Error_Rate     0x000f   078   063   044    Pre-fail  Always       -       69363286
  1 Raw_Read_Error_Rate     0x000f   078   063   044    Pre-fail  Always       -       69486033
  1 Raw_Read_Error_Rate     0x000f   078   063   044    Pre-fail  Always       -       69625038
  1 Raw_Read_Error_Rate     0x000f   078   063   044    Pre-fail  Always       -       69693573
  1 Raw_Read_Error_Rate     0x000f   078   063   044    Pre-fail  Always       -       69829568
  1 Raw_Read_Error_Rate     0x000f   078   063   044    Pre-fail  Always       -       69901410
  1 Raw_Read_Error_Rate     0x000f   078   063   044    Pre-fail  Always       -       69950639
  1 Raw_Read_Error_Rate     0x000f   078   063   044    Pre-fail  Always       -       70037059

User avatar
Wodan
ISO Advocate

19 Jul 2017, 19:38

Are you familair with raid management?

I'm not an expert myself! I guess i could get it to work somehow but it'll be trial and error.

User avatar
webwit
Wild Duck

19 Jul 2017, 19:40

I am not, it's just how the hoster set it up.

User avatar
matt3o
-[°_°]-

19 Jul 2017, 20:10

webwit wrote: I am not, it's just how the hoster set it up.
if you got errors on one hdd raid reconstruction is understood... and that's where raid on just 2 drives is a little pointless since not always the machine can tell which data is actually bad and which one is good (50-50).

A value of 78 in Raw_Read_Error_Rate is not terrible per-se, but looking at the raw value (70037059) it is dropping quickly. I would proceed with the replacement of the hdd ASAP if they let you.

Post Reply

Return to “Deskthority talk”