8 GPU machine freezesbroadcom 5722 NIC not installed on Ubuntu Server, although driver presentLinux freezes every few secondsTrouble installing GTX 480 / Tesla 2050 Dual-GPU for CUDAunexplainable packet drops with 5 ethernet NICs and low traffic on UbuntuHow AWS does GPU virtualization?How important is the CPU when building a CUDA system?Nvidia Pascal architecture: DMA Size / maximum amount of host system RAM?Why is my CUDA GPU-Util ~70% when there are “No running processes found”?Server freezes without kernel panicVanishing network connectivity in HPC cluster

What should I use to get rid of some kind of weed in my onions

My perfect evil overlord plan... or is it?

Why is the episode called "The Last of the Starks"?

Can I bring back Planetary Romance as a genre?

Expl3 and recent xparse on overleaf: No expl3 loader detected

How do I give a darkroom course without negatives from the attendees?

Is it a good idea to copy a trader when investing?

Identity of a supposed anonymous referee revealed through "Description" of the report

How to start your Starctaft II games vs AI immediatly?

Why doesn't increasing the temperature of something like wood or paper set them on fire?

Why did Ham the Chimp push levers?

Every group the homology of some space?

Linear Independence for Vectors of Cosine Values

Do these creatures from the Tomb of Annihilation campaign speak Common?

Is there a need for better software for writers?

My Sixteen Friendly Students

Is there an idiom that means "revealing a secret unintentionally"?

Magical Modulo Squares

What dice to use in a game that revolves around triangles?

Can you turn music upside down?

Align a table column at a specific symbol

What are these pads?

Steganography in Latex

Names of the Six Tastes



8 GPU machine freezes


broadcom 5722 NIC not installed on Ubuntu Server, although driver presentLinux freezes every few secondsTrouble installing GTX 480 / Tesla 2050 Dual-GPU for CUDAunexplainable packet drops with 5 ethernet NICs and low traffic on UbuntuHow AWS does GPU virtualization?How important is the CPU when building a CUDA system?Nvidia Pascal architecture: DMA Size / maximum amount of host system RAM?Why is my CUDA GPU-Util ~70% when there are “No running processes found”?Server freezes without kernel panicVanishing network connectivity in HPC cluster






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








3















We have a SuperMicro GPU server with:



  • 2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz

  • 512GB memory

  • more than enough disk space

  • X10DRG-O+-CPU (BIOS Version : 2.0a [current])

  • X9DRG-O-PCIE PCI-E expander card

  • 8x GTX 1080

It is setup with Ubuntu 16.04.1 LTS, NVIDA driver 367.57 and CUDA-8.0.
When it runs, it runs temporarily fine. It is however completely useless with the stock Kernel (v4.4) -- the system almost immediately freezes when doing something non-trivial on any GPU. We therefore suspected a hardware issue, but cooling is fine, and a second almost identical machine (just different maker of the GPUs) shows the exact same behaviour.



To make it run fine for some time, you have to downgrade the Kernel to v3.14.1-trusty (tested almost every version before that one). But there are still random freezes, usually with nothing in the logs. Sometimes the whole machine freezes, other times just any GPU-related processes.



There seem to be other [1] people [2] having this issue, but no solution there.



Is anyone having the same experience with this type of machine?



Update:
The machines seem to run stable (regardless of any software) if the cards are inserted only on one side of the PCI-E expander, which means all cards are driven by the same CPU.
Another machine however seems to run stable with 8 cards (uptime of about 4 months right now) with Kernel 3.19 after months of having the problems described above. Bizarre.



[1] https://devtalk.nvidia.com/default/topic/958927/gpu-job-fail-/



[2] https://devtalk.nvidia.com/default/topic/959699/linux/nvidia-smi-periodically-crashes-system-on-ubuntu-16-04-lts/










share|improve this question
























  • Does your PSU provide enough power?

    – Gerald Schneider
    Feb 8 '17 at 15:49











  • It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm

    – pks
    Feb 8 '17 at 18:42












  • We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.

    – emjotde
    May 8 '17 at 10:20











  • It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.

    – Marco
    May 8 '17 at 10:39











  • We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.

    – David Bau
    Sep 4 '17 at 2:39

















3















We have a SuperMicro GPU server with:



  • 2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz

  • 512GB memory

  • more than enough disk space

  • X10DRG-O+-CPU (BIOS Version : 2.0a [current])

  • X9DRG-O-PCIE PCI-E expander card

  • 8x GTX 1080

It is setup with Ubuntu 16.04.1 LTS, NVIDA driver 367.57 and CUDA-8.0.
When it runs, it runs temporarily fine. It is however completely useless with the stock Kernel (v4.4) -- the system almost immediately freezes when doing something non-trivial on any GPU. We therefore suspected a hardware issue, but cooling is fine, and a second almost identical machine (just different maker of the GPUs) shows the exact same behaviour.



To make it run fine for some time, you have to downgrade the Kernel to v3.14.1-trusty (tested almost every version before that one). But there are still random freezes, usually with nothing in the logs. Sometimes the whole machine freezes, other times just any GPU-related processes.



There seem to be other [1] people [2] having this issue, but no solution there.



Is anyone having the same experience with this type of machine?



Update:
The machines seem to run stable (regardless of any software) if the cards are inserted only on one side of the PCI-E expander, which means all cards are driven by the same CPU.
Another machine however seems to run stable with 8 cards (uptime of about 4 months right now) with Kernel 3.19 after months of having the problems described above. Bizarre.



[1] https://devtalk.nvidia.com/default/topic/958927/gpu-job-fail-/



[2] https://devtalk.nvidia.com/default/topic/959699/linux/nvidia-smi-periodically-crashes-system-on-ubuntu-16-04-lts/










share|improve this question
























  • Does your PSU provide enough power?

    – Gerald Schneider
    Feb 8 '17 at 15:49











  • It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm

    – pks
    Feb 8 '17 at 18:42












  • We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.

    – emjotde
    May 8 '17 at 10:20











  • It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.

    – Marco
    May 8 '17 at 10:39











  • We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.

    – David Bau
    Sep 4 '17 at 2:39













3












3








3


3






We have a SuperMicro GPU server with:



  • 2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz

  • 512GB memory

  • more than enough disk space

  • X10DRG-O+-CPU (BIOS Version : 2.0a [current])

  • X9DRG-O-PCIE PCI-E expander card

  • 8x GTX 1080

It is setup with Ubuntu 16.04.1 LTS, NVIDA driver 367.57 and CUDA-8.0.
When it runs, it runs temporarily fine. It is however completely useless with the stock Kernel (v4.4) -- the system almost immediately freezes when doing something non-trivial on any GPU. We therefore suspected a hardware issue, but cooling is fine, and a second almost identical machine (just different maker of the GPUs) shows the exact same behaviour.



To make it run fine for some time, you have to downgrade the Kernel to v3.14.1-trusty (tested almost every version before that one). But there are still random freezes, usually with nothing in the logs. Sometimes the whole machine freezes, other times just any GPU-related processes.



There seem to be other [1] people [2] having this issue, but no solution there.



Is anyone having the same experience with this type of machine?



Update:
The machines seem to run stable (regardless of any software) if the cards are inserted only on one side of the PCI-E expander, which means all cards are driven by the same CPU.
Another machine however seems to run stable with 8 cards (uptime of about 4 months right now) with Kernel 3.19 after months of having the problems described above. Bizarre.



[1] https://devtalk.nvidia.com/default/topic/958927/gpu-job-fail-/



[2] https://devtalk.nvidia.com/default/topic/959699/linux/nvidia-smi-periodically-crashes-system-on-ubuntu-16-04-lts/










share|improve this question
















We have a SuperMicro GPU server with:



  • 2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz

  • 512GB memory

  • more than enough disk space

  • X10DRG-O+-CPU (BIOS Version : 2.0a [current])

  • X9DRG-O-PCIE PCI-E expander card

  • 8x GTX 1080

It is setup with Ubuntu 16.04.1 LTS, NVIDA driver 367.57 and CUDA-8.0.
When it runs, it runs temporarily fine. It is however completely useless with the stock Kernel (v4.4) -- the system almost immediately freezes when doing something non-trivial on any GPU. We therefore suspected a hardware issue, but cooling is fine, and a second almost identical machine (just different maker of the GPUs) shows the exact same behaviour.



To make it run fine for some time, you have to downgrade the Kernel to v3.14.1-trusty (tested almost every version before that one). But there are still random freezes, usually with nothing in the logs. Sometimes the whole machine freezes, other times just any GPU-related processes.



There seem to be other [1] people [2] having this issue, but no solution there.



Is anyone having the same experience with this type of machine?



Update:
The machines seem to run stable (regardless of any software) if the cards are inserted only on one side of the PCI-E expander, which means all cards are driven by the same CPU.
Another machine however seems to run stable with 8 cards (uptime of about 4 months right now) with Kernel 3.19 after months of having the problems described above. Bizarre.



[1] https://devtalk.nvidia.com/default/topic/958927/gpu-job-fail-/



[2] https://devtalk.nvidia.com/default/topic/959699/linux/nvidia-smi-periodically-crashes-system-on-ubuntu-16-04-lts/







ubuntu supermicro cuda nvidia






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 13 '17 at 18:39







pks

















asked Feb 8 '17 at 11:51









pkspks

163




163












  • Does your PSU provide enough power?

    – Gerald Schneider
    Feb 8 '17 at 15:49











  • It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm

    – pks
    Feb 8 '17 at 18:42












  • We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.

    – emjotde
    May 8 '17 at 10:20











  • It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.

    – Marco
    May 8 '17 at 10:39











  • We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.

    – David Bau
    Sep 4 '17 at 2:39

















  • Does your PSU provide enough power?

    – Gerald Schneider
    Feb 8 '17 at 15:49











  • It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm

    – pks
    Feb 8 '17 at 18:42












  • We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.

    – emjotde
    May 8 '17 at 10:20











  • It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.

    – Marco
    May 8 '17 at 10:39











  • We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.

    – David Bau
    Sep 4 '17 at 2:39
















Does your PSU provide enough power?

– Gerald Schneider
Feb 8 '17 at 15:49





Does your PSU provide enough power?

– Gerald Schneider
Feb 8 '17 at 15:49













It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm

– pks
Feb 8 '17 at 18:42






It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm

– pks
Feb 8 '17 at 18:42














We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.

– emjotde
May 8 '17 at 10:20





We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.

– emjotde
May 8 '17 at 10:20













It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.

– Marco
May 8 '17 at 10:39





It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.

– Marco
May 8 '17 at 10:39













We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.

– David Bau
Sep 4 '17 at 2:39





We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.

– David Bau
Sep 4 '17 at 2:39










2 Answers
2






active

oldest

votes


















1














I had the exact same issue on the same computer. To fix this, you will need to disable the on-board VGA by changing jumper JPG1 on the motherboard. Unfortunately, you'll need to remove the daughterboard to do so. Note that, to re-install the daughterboard, you may need to apply quite a bit of pressure for it to connect properly with the motherboard again.






share|improve this answer























  • I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

    – David Steinhauer
    Apr 11 '18 at 14:47











  • With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

    – David Steinhauer
    Apr 11 '18 at 14:50


















0














There is a known issue with the PCI bus (power management) that seems to be resolved by SuperMicro. We have just received a flashable BIOS+firmware update from them and are testing it.
I don't think I can share the update (unsure about licensing) so would advise you to contact SuperMicro..






share|improve this answer

























  • Hi adev, any news about your GPU server?

    – lhlmgr
    Sep 28 '17 at 10:47











Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "2"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f831309%2f8-gpu-machine-freezes%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














I had the exact same issue on the same computer. To fix this, you will need to disable the on-board VGA by changing jumper JPG1 on the motherboard. Unfortunately, you'll need to remove the daughterboard to do so. Note that, to re-install the daughterboard, you may need to apply quite a bit of pressure for it to connect properly with the motherboard again.






share|improve this answer























  • I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

    – David Steinhauer
    Apr 11 '18 at 14:47











  • With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

    – David Steinhauer
    Apr 11 '18 at 14:50















1














I had the exact same issue on the same computer. To fix this, you will need to disable the on-board VGA by changing jumper JPG1 on the motherboard. Unfortunately, you'll need to remove the daughterboard to do so. Note that, to re-install the daughterboard, you may need to apply quite a bit of pressure for it to connect properly with the motherboard again.






share|improve this answer























  • I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

    – David Steinhauer
    Apr 11 '18 at 14:47











  • With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

    – David Steinhauer
    Apr 11 '18 at 14:50













1












1








1







I had the exact same issue on the same computer. To fix this, you will need to disable the on-board VGA by changing jumper JPG1 on the motherboard. Unfortunately, you'll need to remove the daughterboard to do so. Note that, to re-install the daughterboard, you may need to apply quite a bit of pressure for it to connect properly with the motherboard again.






share|improve this answer













I had the exact same issue on the same computer. To fix this, you will need to disable the on-board VGA by changing jumper JPG1 on the motherboard. Unfortunately, you'll need to remove the daughterboard to do so. Note that, to re-install the daughterboard, you may need to apply quite a bit of pressure for it to connect properly with the motherboard again.







share|improve this answer












share|improve this answer



share|improve this answer










answered Jun 24 '17 at 6:23









tinkerthinkertinkerthinker

212




212












  • I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

    – David Steinhauer
    Apr 11 '18 at 14:47











  • With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

    – David Steinhauer
    Apr 11 '18 at 14:50

















  • I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

    – David Steinhauer
    Apr 11 '18 at 14:47











  • With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

    – David Steinhauer
    Apr 11 '18 at 14:50
















I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

– David Steinhauer
Apr 11 '18 at 14:47





I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

– David Steinhauer
Apr 11 '18 at 14:47













With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

– David Steinhauer
Apr 11 '18 at 14:50





With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

– David Steinhauer
Apr 11 '18 at 14:50













0














There is a known issue with the PCI bus (power management) that seems to be resolved by SuperMicro. We have just received a flashable BIOS+firmware update from them and are testing it.
I don't think I can share the update (unsure about licensing) so would advise you to contact SuperMicro..






share|improve this answer

























  • Hi adev, any news about your GPU server?

    – lhlmgr
    Sep 28 '17 at 10:47















0














There is a known issue with the PCI bus (power management) that seems to be resolved by SuperMicro. We have just received a flashable BIOS+firmware update from them and are testing it.
I don't think I can share the update (unsure about licensing) so would advise you to contact SuperMicro..






share|improve this answer

























  • Hi adev, any news about your GPU server?

    – lhlmgr
    Sep 28 '17 at 10:47













0












0








0







There is a known issue with the PCI bus (power management) that seems to be resolved by SuperMicro. We have just received a flashable BIOS+firmware update from them and are testing it.
I don't think I can share the update (unsure about licensing) so would advise you to contact SuperMicro..






share|improve this answer















There is a known issue with the PCI bus (power management) that seems to be resolved by SuperMicro. We have just received a flashable BIOS+firmware update from them and are testing it.
I don't think I can share the update (unsure about licensing) so would advise you to contact SuperMicro..







share|improve this answer














share|improve this answer



share|improve this answer








edited May 22 '17 at 6:19

























answered May 16 '17 at 6:59









adevadev

11




11












  • Hi adev, any news about your GPU server?

    – lhlmgr
    Sep 28 '17 at 10:47

















  • Hi adev, any news about your GPU server?

    – lhlmgr
    Sep 28 '17 at 10:47
















Hi adev, any news about your GPU server?

– lhlmgr
Sep 28 '17 at 10:47





Hi adev, any news about your GPU server?

– lhlmgr
Sep 28 '17 at 10:47

















draft saved

draft discarded
















































Thanks for contributing an answer to Server Fault!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f831309%2f8-gpu-machine-freezes%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Wikipedia:Vital articles Мазмуну Biography - Өмүр баян Philosophy and psychology - Философия жана психология Religion - Дин Social sciences - Коомдук илимдер Language and literature - Тил жана адабият Science - Илим Technology - Технология Arts and recreation - Искусство жана эс алуу History and geography - Тарых жана география Навигация менюсу

Bruxelas-Capital Índice Historia | Composición | Situación lingüística | Clima | Cidades irmandadas | Notas | Véxase tamén | Menú de navegacióneO uso das linguas en Bruxelas e a situación do neerlandés"Rexión de Bruxelas Capital"o orixinalSitio da rexiónPáxina de Bruselas no sitio da Oficina de Promoción Turística de Valonia e BruxelasMapa Interactivo da Rexión de Bruxelas-CapitaleeWorldCat332144929079854441105155190212ID28008674080552-90000 0001 0666 3698n94104302ID540940339365017018237

What should I write in an apology letter, since I have decided not to join a company after accepting an offer letterShould I keep looking after accepting a job offer?What should I do when I've been verbally told I would get an offer letter, but still haven't gotten one after 4 weeks?Do I accept an offer from a company that I am not likely to join?New job hasn't confirmed starting date and I want to give current employer as much notice as possibleHow should I address my manager in my resignation letter?HR delayed background verification, now jobless as resignedNo email communication after accepting a formal written offer. How should I phrase the call?What should I do if after receiving a verbal offer letter I am informed that my written job offer is put on hold due to some internal issues?Should I inform the current employer that I am about to resign within 1-2 weeks since I have signed the offer letter and waiting for visa?What company will do, if I send their offer letter to another company