8 GPU machine freezesbroadcom 5722 NIC not installed on Ubuntu Server, although driver presentLinux freezes every few secondsTrouble installing GTX 480 / Tesla 2050 Dual-GPU for CUDAunexplainable packet drops with 5 ethernet NICs and low traffic on UbuntuHow AWS does GPU virtualization?How important is the CPU when building a CUDA system?Nvidia Pascal architecture: DMA Size / maximum amount of host system RAM?Why is my CUDA GPU-Util ~70% when there are “No running processes found”?Server freezes without kernel panicVanishing network connectivity in HPC cluster

What should I use to get rid of some kind of weed in my onions

My perfect evil overlord plan... or is it?

Why is the episode called "The Last of the Starks"?

Can I bring back Planetary Romance as a genre?

Expl3 and recent xparse on overleaf: No expl3 loader detected

How do I give a darkroom course without negatives from the attendees?

Is it a good idea to copy a trader when investing?

Identity of a supposed anonymous referee revealed through "Description" of the report

How to start your Starctaft II games vs AI immediatly?

Why doesn't increasing the temperature of something like wood or paper set them on fire?

Why did Ham the Chimp push levers?

Every group the homology of some space?

Linear Independence for Vectors of Cosine Values

Do these creatures from the Tomb of Annihilation campaign speak Common?

Is there a need for better software for writers?

My Sixteen Friendly Students

Is there an idiom that means "revealing a secret unintentionally"?

Magical Modulo Squares

What dice to use in a game that revolves around triangles?

Can you turn music upside down?

Align a table column at a specific symbol

What are these pads?

Steganography in Latex

Names of the Six Tastes



8 GPU machine freezes


broadcom 5722 NIC not installed on Ubuntu Server, although driver presentLinux freezes every few secondsTrouble installing GTX 480 / Tesla 2050 Dual-GPU for CUDAunexplainable packet drops with 5 ethernet NICs and low traffic on UbuntuHow AWS does GPU virtualization?How important is the CPU when building a CUDA system?Nvidia Pascal architecture: DMA Size / maximum amount of host system RAM?Why is my CUDA GPU-Util ~70% when there are “No running processes found”?Server freezes without kernel panicVanishing network connectivity in HPC cluster






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








3















We have a SuperMicro GPU server with:



  • 2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz

  • 512GB memory

  • more than enough disk space

  • X10DRG-O+-CPU (BIOS Version : 2.0a [current])

  • X9DRG-O-PCIE PCI-E expander card

  • 8x GTX 1080

It is setup with Ubuntu 16.04.1 LTS, NVIDA driver 367.57 and CUDA-8.0.
When it runs, it runs temporarily fine. It is however completely useless with the stock Kernel (v4.4) -- the system almost immediately freezes when doing something non-trivial on any GPU. We therefore suspected a hardware issue, but cooling is fine, and a second almost identical machine (just different maker of the GPUs) shows the exact same behaviour.



To make it run fine for some time, you have to downgrade the Kernel to v3.14.1-trusty (tested almost every version before that one). But there are still random freezes, usually with nothing in the logs. Sometimes the whole machine freezes, other times just any GPU-related processes.



There seem to be other [1] people [2] having this issue, but no solution there.



Is anyone having the same experience with this type of machine?



Update:
The machines seem to run stable (regardless of any software) if the cards are inserted only on one side of the PCI-E expander, which means all cards are driven by the same CPU.
Another machine however seems to run stable with 8 cards (uptime of about 4 months right now) with Kernel 3.19 after months of having the problems described above. Bizarre.



[1] https://devtalk.nvidia.com/default/topic/958927/gpu-job-fail-/



[2] https://devtalk.nvidia.com/default/topic/959699/linux/nvidia-smi-periodically-crashes-system-on-ubuntu-16-04-lts/










share|improve this question
























  • Does your PSU provide enough power?

    – Gerald Schneider
    Feb 8 '17 at 15:49











  • It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm

    – pks
    Feb 8 '17 at 18:42












  • We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.

    – emjotde
    May 8 '17 at 10:20











  • It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.

    – Marco
    May 8 '17 at 10:39











  • We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.

    – David Bau
    Sep 4 '17 at 2:39

















3















We have a SuperMicro GPU server with:



  • 2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz

  • 512GB memory

  • more than enough disk space

  • X10DRG-O+-CPU (BIOS Version : 2.0a [current])

  • X9DRG-O-PCIE PCI-E expander card

  • 8x GTX 1080

It is setup with Ubuntu 16.04.1 LTS, NVIDA driver 367.57 and CUDA-8.0.
When it runs, it runs temporarily fine. It is however completely useless with the stock Kernel (v4.4) -- the system almost immediately freezes when doing something non-trivial on any GPU. We therefore suspected a hardware issue, but cooling is fine, and a second almost identical machine (just different maker of the GPUs) shows the exact same behaviour.



To make it run fine for some time, you have to downgrade the Kernel to v3.14.1-trusty (tested almost every version before that one). But there are still random freezes, usually with nothing in the logs. Sometimes the whole machine freezes, other times just any GPU-related processes.



There seem to be other [1] people [2] having this issue, but no solution there.



Is anyone having the same experience with this type of machine?



Update:
The machines seem to run stable (regardless of any software) if the cards are inserted only on one side of the PCI-E expander, which means all cards are driven by the same CPU.
Another machine however seems to run stable with 8 cards (uptime of about 4 months right now) with Kernel 3.19 after months of having the problems described above. Bizarre.



[1] https://devtalk.nvidia.com/default/topic/958927/gpu-job-fail-/



[2] https://devtalk.nvidia.com/default/topic/959699/linux/nvidia-smi-periodically-crashes-system-on-ubuntu-16-04-lts/










share|improve this question
























  • Does your PSU provide enough power?

    – Gerald Schneider
    Feb 8 '17 at 15:49











  • It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm

    – pks
    Feb 8 '17 at 18:42












  • We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.

    – emjotde
    May 8 '17 at 10:20











  • It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.

    – Marco
    May 8 '17 at 10:39











  • We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.

    – David Bau
    Sep 4 '17 at 2:39













3












3








3


3






We have a SuperMicro GPU server with:



  • 2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz

  • 512GB memory

  • more than enough disk space

  • X10DRG-O+-CPU (BIOS Version : 2.0a [current])

  • X9DRG-O-PCIE PCI-E expander card

  • 8x GTX 1080

It is setup with Ubuntu 16.04.1 LTS, NVIDA driver 367.57 and CUDA-8.0.
When it runs, it runs temporarily fine. It is however completely useless with the stock Kernel (v4.4) -- the system almost immediately freezes when doing something non-trivial on any GPU. We therefore suspected a hardware issue, but cooling is fine, and a second almost identical machine (just different maker of the GPUs) shows the exact same behaviour.



To make it run fine for some time, you have to downgrade the Kernel to v3.14.1-trusty (tested almost every version before that one). But there are still random freezes, usually with nothing in the logs. Sometimes the whole machine freezes, other times just any GPU-related processes.



There seem to be other [1] people [2] having this issue, but no solution there.



Is anyone having the same experience with this type of machine?



Update:
The machines seem to run stable (regardless of any software) if the cards are inserted only on one side of the PCI-E expander, which means all cards are driven by the same CPU.
Another machine however seems to run stable with 8 cards (uptime of about 4 months right now) with Kernel 3.19 after months of having the problems described above. Bizarre.



[1] https://devtalk.nvidia.com/default/topic/958927/gpu-job-fail-/



[2] https://devtalk.nvidia.com/default/topic/959699/linux/nvidia-smi-periodically-crashes-system-on-ubuntu-16-04-lts/










share|improve this question
















We have a SuperMicro GPU server with:



  • 2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz

  • 512GB memory

  • more than enough disk space

  • X10DRG-O+-CPU (BIOS Version : 2.0a [current])

  • X9DRG-O-PCIE PCI-E expander card

  • 8x GTX 1080

It is setup with Ubuntu 16.04.1 LTS, NVIDA driver 367.57 and CUDA-8.0.
When it runs, it runs temporarily fine. It is however completely useless with the stock Kernel (v4.4) -- the system almost immediately freezes when doing something non-trivial on any GPU. We therefore suspected a hardware issue, but cooling is fine, and a second almost identical machine (just different maker of the GPUs) shows the exact same behaviour.



To make it run fine for some time, you have to downgrade the Kernel to v3.14.1-trusty (tested almost every version before that one). But there are still random freezes, usually with nothing in the logs. Sometimes the whole machine freezes, other times just any GPU-related processes.



There seem to be other [1] people [2] having this issue, but no solution there.



Is anyone having the same experience with this type of machine?



Update:
The machines seem to run stable (regardless of any software) if the cards are inserted only on one side of the PCI-E expander, which means all cards are driven by the same CPU.
Another machine however seems to run stable with 8 cards (uptime of about 4 months right now) with Kernel 3.19 after months of having the problems described above. Bizarre.



[1] https://devtalk.nvidia.com/default/topic/958927/gpu-job-fail-/



[2] https://devtalk.nvidia.com/default/topic/959699/linux/nvidia-smi-periodically-crashes-system-on-ubuntu-16-04-lts/







ubuntu supermicro cuda nvidia






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 13 '17 at 18:39







pks

















asked Feb 8 '17 at 11:51









pkspks

163




163












  • Does your PSU provide enough power?

    – Gerald Schneider
    Feb 8 '17 at 15:49











  • It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm

    – pks
    Feb 8 '17 at 18:42












  • We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.

    – emjotde
    May 8 '17 at 10:20











  • It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.

    – Marco
    May 8 '17 at 10:39











  • We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.

    – David Bau
    Sep 4 '17 at 2:39

















  • Does your PSU provide enough power?

    – Gerald Schneider
    Feb 8 '17 at 15:49











  • It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm

    – pks
    Feb 8 '17 at 18:42












  • We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.

    – emjotde
    May 8 '17 at 10:20











  • It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.

    – Marco
    May 8 '17 at 10:39











  • We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.

    – David Bau
    Sep 4 '17 at 2:39
















Does your PSU provide enough power?

– Gerald Schneider
Feb 8 '17 at 15:49





Does your PSU provide enough power?

– Gerald Schneider
Feb 8 '17 at 15:49













It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm

– pks
Feb 8 '17 at 18:42






It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm

– pks
Feb 8 '17 at 18:42














We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.

– emjotde
May 8 '17 at 10:20





We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.

– emjotde
May 8 '17 at 10:20













It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.

– Marco
May 8 '17 at 10:39





It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.

– Marco
May 8 '17 at 10:39













We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.

– David Bau
Sep 4 '17 at 2:39





We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.

– David Bau
Sep 4 '17 at 2:39










2 Answers
2






active

oldest

votes


















1














I had the exact same issue on the same computer. To fix this, you will need to disable the on-board VGA by changing jumper JPG1 on the motherboard. Unfortunately, you'll need to remove the daughterboard to do so. Note that, to re-install the daughterboard, you may need to apply quite a bit of pressure for it to connect properly with the motherboard again.






share|improve this answer























  • I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

    – David Steinhauer
    Apr 11 '18 at 14:47











  • With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

    – David Steinhauer
    Apr 11 '18 at 14:50


















0














There is a known issue with the PCI bus (power management) that seems to be resolved by SuperMicro. We have just received a flashable BIOS+firmware update from them and are testing it.
I don't think I can share the update (unsure about licensing) so would advise you to contact SuperMicro..






share|improve this answer

























  • Hi adev, any news about your GPU server?

    – lhlmgr
    Sep 28 '17 at 10:47











Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "2"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f831309%2f8-gpu-machine-freezes%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














I had the exact same issue on the same computer. To fix this, you will need to disable the on-board VGA by changing jumper JPG1 on the motherboard. Unfortunately, you'll need to remove the daughterboard to do so. Note that, to re-install the daughterboard, you may need to apply quite a bit of pressure for it to connect properly with the motherboard again.






share|improve this answer























  • I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

    – David Steinhauer
    Apr 11 '18 at 14:47











  • With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

    – David Steinhauer
    Apr 11 '18 at 14:50















1














I had the exact same issue on the same computer. To fix this, you will need to disable the on-board VGA by changing jumper JPG1 on the motherboard. Unfortunately, you'll need to remove the daughterboard to do so. Note that, to re-install the daughterboard, you may need to apply quite a bit of pressure for it to connect properly with the motherboard again.






share|improve this answer























  • I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

    – David Steinhauer
    Apr 11 '18 at 14:47











  • With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

    – David Steinhauer
    Apr 11 '18 at 14:50













1












1








1







I had the exact same issue on the same computer. To fix this, you will need to disable the on-board VGA by changing jumper JPG1 on the motherboard. Unfortunately, you'll need to remove the daughterboard to do so. Note that, to re-install the daughterboard, you may need to apply quite a bit of pressure for it to connect properly with the motherboard again.






share|improve this answer













I had the exact same issue on the same computer. To fix this, you will need to disable the on-board VGA by changing jumper JPG1 on the motherboard. Unfortunately, you'll need to remove the daughterboard to do so. Note that, to re-install the daughterboard, you may need to apply quite a bit of pressure for it to connect properly with the motherboard again.







share|improve this answer












share|improve this answer



share|improve this answer










answered Jun 24 '17 at 6:23









tinkerthinkertinkerthinker

212




212












  • I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

    – David Steinhauer
    Apr 11 '18 at 14:47











  • With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

    – David Steinhauer
    Apr 11 '18 at 14:50

















  • I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

    – David Steinhauer
    Apr 11 '18 at 14:47











  • With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

    – David Steinhauer
    Apr 11 '18 at 14:50
















I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

– David Steinhauer
Apr 11 '18 at 14:47





I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

– David Steinhauer
Apr 11 '18 at 14:47













With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

– David Steinhauer
Apr 11 '18 at 14:50





With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

– David Steinhauer
Apr 11 '18 at 14:50













0














There is a known issue with the PCI bus (power management) that seems to be resolved by SuperMicro. We have just received a flashable BIOS+firmware update from them and are testing it.
I don't think I can share the update (unsure about licensing) so would advise you to contact SuperMicro..






share|improve this answer

























  • Hi adev, any news about your GPU server?

    – lhlmgr
    Sep 28 '17 at 10:47















0














There is a known issue with the PCI bus (power management) that seems to be resolved by SuperMicro. We have just received a flashable BIOS+firmware update from them and are testing it.
I don't think I can share the update (unsure about licensing) so would advise you to contact SuperMicro..






share|improve this answer

























  • Hi adev, any news about your GPU server?

    – lhlmgr
    Sep 28 '17 at 10:47













0












0








0







There is a known issue with the PCI bus (power management) that seems to be resolved by SuperMicro. We have just received a flashable BIOS+firmware update from them and are testing it.
I don't think I can share the update (unsure about licensing) so would advise you to contact SuperMicro..






share|improve this answer















There is a known issue with the PCI bus (power management) that seems to be resolved by SuperMicro. We have just received a flashable BIOS+firmware update from them and are testing it.
I don't think I can share the update (unsure about licensing) so would advise you to contact SuperMicro..







share|improve this answer














share|improve this answer



share|improve this answer








edited May 22 '17 at 6:19

























answered May 16 '17 at 6:59









adevadev

11




11












  • Hi adev, any news about your GPU server?

    – lhlmgr
    Sep 28 '17 at 10:47

















  • Hi adev, any news about your GPU server?

    – lhlmgr
    Sep 28 '17 at 10:47
















Hi adev, any news about your GPU server?

– lhlmgr
Sep 28 '17 at 10:47





Hi adev, any news about your GPU server?

– lhlmgr
Sep 28 '17 at 10:47

















draft saved

draft discarded
















































Thanks for contributing an answer to Server Fault!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f831309%2f8-gpu-machine-freezes%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

RemoteApp sporadic failureWindows 2008 RemoteAPP client disconnects within a matter of minutesWhat is the minimum version of RDP supported by Server 2012 RDS?How to configure a Remoteapp server to increase stabilityMicrosoft RemoteApp Active SessionRDWeb TS connection broken for some users post RemoteApp certificate changeRemote Desktop Licensing, RemoteAPPRDS 2012 R2 some users are not able to logon after changed date and time on Connection BrokersWhat happens during Remote Desktop logon, and is there any logging?After installing RDS on WinServer 2016 I still can only connect with two users?RD Connection via RDGW to Session host is not connecting

How to write a 12-bar blues melodyI-IV-V blues progressionHow to play the bridges in a standard blues progressionHow does Gdim7 fit in C# minor?question on a certain chord progressionMusicology of Melody12 bar blues, spread rhythm: alternative to 6th chord to avoid finger stretchChord progressions/ Root key/ MelodiesHow to put chords (POP-EDM) under a given lead vocal melody (starting from a good knowledge in music theory)Are there “rules” for improvising with the minor pentatonic scale over 12-bar shuffle?Confusion about blues scale and chords

Esgonzo ibérico Índice Descrición Distribución Hábitat Ameazas Notas Véxase tamén "Acerca dos nomes dos anfibios e réptiles galegos""Chalcides bedriagai"Chalcides bedriagai en Carrascal, L. M. Salvador, A. (Eds). Enciclopedia virtual de los vertebrados españoles. Museo Nacional de Ciencias Naturales, Madrid. España.Fotos