8 GPU machine freezesbroadcom 5722 NIC not installed on Ubuntu Server, although driver presentLinux freezes every few secondsTrouble installing GTX 480 / Tesla 2050 Dual-GPU for CUDAunexplainable packet drops with 5 ethernet NICs and low traffic on UbuntuHow AWS does GPU virtualization?How important is the CPU when building a CUDA system?Nvidia Pascal architecture: DMA Size / maximum amount of host system RAM?Why is my CUDA GPU-Util ~70% when there are “No running processes found”?Server freezes without kernel panicVanishing network connectivity in HPC cluster

What should I use to get rid of some kind of weed in my onions

My perfect evil overlord plan... or is it?

Why is the episode called "The Last of the Starks"?

Can I bring back Planetary Romance as a genre?

Expl3 and recent xparse on overleaf: No expl3 loader detected

How do I give a darkroom course without negatives from the attendees?

Is it a good idea to copy a trader when investing?

Identity of a supposed anonymous referee revealed through "Description" of the report

How to start your Starctaft II games vs AI immediatly?

Why doesn't increasing the temperature of something like wood or paper set them on fire?

Why did Ham the Chimp push levers?

Every group the homology of some space?

Linear Independence for Vectors of Cosine Values

Do these creatures from the Tomb of Annihilation campaign speak Common?

Is there a need for better software for writers?

My Sixteen Friendly Students

Is there an idiom that means "revealing a secret unintentionally"?

Magical Modulo Squares

What dice to use in a game that revolves around triangles?

Can you turn music upside down?

Align a table column at a specific symbol

What are these pads?

Steganography in Latex

Names of the Six Tastes



8 GPU machine freezes


broadcom 5722 NIC not installed on Ubuntu Server, although driver presentLinux freezes every few secondsTrouble installing GTX 480 / Tesla 2050 Dual-GPU for CUDAunexplainable packet drops with 5 ethernet NICs and low traffic on UbuntuHow AWS does GPU virtualization?How important is the CPU when building a CUDA system?Nvidia Pascal architecture: DMA Size / maximum amount of host system RAM?Why is my CUDA GPU-Util ~70% when there are “No running processes found”?Server freezes without kernel panicVanishing network connectivity in HPC cluster






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








3















We have a SuperMicro GPU server with:



  • 2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz

  • 512GB memory

  • more than enough disk space

  • X10DRG-O+-CPU (BIOS Version : 2.0a [current])

  • X9DRG-O-PCIE PCI-E expander card

  • 8x GTX 1080

It is setup with Ubuntu 16.04.1 LTS, NVIDA driver 367.57 and CUDA-8.0.
When it runs, it runs temporarily fine. It is however completely useless with the stock Kernel (v4.4) -- the system almost immediately freezes when doing something non-trivial on any GPU. We therefore suspected a hardware issue, but cooling is fine, and a second almost identical machine (just different maker of the GPUs) shows the exact same behaviour.



To make it run fine for some time, you have to downgrade the Kernel to v3.14.1-trusty (tested almost every version before that one). But there are still random freezes, usually with nothing in the logs. Sometimes the whole machine freezes, other times just any GPU-related processes.



There seem to be other [1] people [2] having this issue, but no solution there.



Is anyone having the same experience with this type of machine?



Update:
The machines seem to run stable (regardless of any software) if the cards are inserted only on one side of the PCI-E expander, which means all cards are driven by the same CPU.
Another machine however seems to run stable with 8 cards (uptime of about 4 months right now) with Kernel 3.19 after months of having the problems described above. Bizarre.



[1] https://devtalk.nvidia.com/default/topic/958927/gpu-job-fail-/



[2] https://devtalk.nvidia.com/default/topic/959699/linux/nvidia-smi-periodically-crashes-system-on-ubuntu-16-04-lts/










share|improve this question
























  • Does your PSU provide enough power?

    – Gerald Schneider
    Feb 8 '17 at 15:49











  • It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm

    – pks
    Feb 8 '17 at 18:42












  • We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.

    – emjotde
    May 8 '17 at 10:20











  • It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.

    – Marco
    May 8 '17 at 10:39











  • We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.

    – David Bau
    Sep 4 '17 at 2:39

















3















We have a SuperMicro GPU server with:



  • 2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz

  • 512GB memory

  • more than enough disk space

  • X10DRG-O+-CPU (BIOS Version : 2.0a [current])

  • X9DRG-O-PCIE PCI-E expander card

  • 8x GTX 1080

It is setup with Ubuntu 16.04.1 LTS, NVIDA driver 367.57 and CUDA-8.0.
When it runs, it runs temporarily fine. It is however completely useless with the stock Kernel (v4.4) -- the system almost immediately freezes when doing something non-trivial on any GPU. We therefore suspected a hardware issue, but cooling is fine, and a second almost identical machine (just different maker of the GPUs) shows the exact same behaviour.



To make it run fine for some time, you have to downgrade the Kernel to v3.14.1-trusty (tested almost every version before that one). But there are still random freezes, usually with nothing in the logs. Sometimes the whole machine freezes, other times just any GPU-related processes.



There seem to be other [1] people [2] having this issue, but no solution there.



Is anyone having the same experience with this type of machine?



Update:
The machines seem to run stable (regardless of any software) if the cards are inserted only on one side of the PCI-E expander, which means all cards are driven by the same CPU.
Another machine however seems to run stable with 8 cards (uptime of about 4 months right now) with Kernel 3.19 after months of having the problems described above. Bizarre.



[1] https://devtalk.nvidia.com/default/topic/958927/gpu-job-fail-/



[2] https://devtalk.nvidia.com/default/topic/959699/linux/nvidia-smi-periodically-crashes-system-on-ubuntu-16-04-lts/










share|improve this question
























  • Does your PSU provide enough power?

    – Gerald Schneider
    Feb 8 '17 at 15:49











  • It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm

    – pks
    Feb 8 '17 at 18:42












  • We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.

    – emjotde
    May 8 '17 at 10:20











  • It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.

    – Marco
    May 8 '17 at 10:39











  • We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.

    – David Bau
    Sep 4 '17 at 2:39













3












3








3


3






We have a SuperMicro GPU server with:



  • 2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz

  • 512GB memory

  • more than enough disk space

  • X10DRG-O+-CPU (BIOS Version : 2.0a [current])

  • X9DRG-O-PCIE PCI-E expander card

  • 8x GTX 1080

It is setup with Ubuntu 16.04.1 LTS, NVIDA driver 367.57 and CUDA-8.0.
When it runs, it runs temporarily fine. It is however completely useless with the stock Kernel (v4.4) -- the system almost immediately freezes when doing something non-trivial on any GPU. We therefore suspected a hardware issue, but cooling is fine, and a second almost identical machine (just different maker of the GPUs) shows the exact same behaviour.



To make it run fine for some time, you have to downgrade the Kernel to v3.14.1-trusty (tested almost every version before that one). But there are still random freezes, usually with nothing in the logs. Sometimes the whole machine freezes, other times just any GPU-related processes.



There seem to be other [1] people [2] having this issue, but no solution there.



Is anyone having the same experience with this type of machine?



Update:
The machines seem to run stable (regardless of any software) if the cards are inserted only on one side of the PCI-E expander, which means all cards are driven by the same CPU.
Another machine however seems to run stable with 8 cards (uptime of about 4 months right now) with Kernel 3.19 after months of having the problems described above. Bizarre.



[1] https://devtalk.nvidia.com/default/topic/958927/gpu-job-fail-/



[2] https://devtalk.nvidia.com/default/topic/959699/linux/nvidia-smi-periodically-crashes-system-on-ubuntu-16-04-lts/










share|improve this question
















We have a SuperMicro GPU server with:



  • 2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz

  • 512GB memory

  • more than enough disk space

  • X10DRG-O+-CPU (BIOS Version : 2.0a [current])

  • X9DRG-O-PCIE PCI-E expander card

  • 8x GTX 1080

It is setup with Ubuntu 16.04.1 LTS, NVIDA driver 367.57 and CUDA-8.0.
When it runs, it runs temporarily fine. It is however completely useless with the stock Kernel (v4.4) -- the system almost immediately freezes when doing something non-trivial on any GPU. We therefore suspected a hardware issue, but cooling is fine, and a second almost identical machine (just different maker of the GPUs) shows the exact same behaviour.



To make it run fine for some time, you have to downgrade the Kernel to v3.14.1-trusty (tested almost every version before that one). But there are still random freezes, usually with nothing in the logs. Sometimes the whole machine freezes, other times just any GPU-related processes.



There seem to be other [1] people [2] having this issue, but no solution there.



Is anyone having the same experience with this type of machine?



Update:
The machines seem to run stable (regardless of any software) if the cards are inserted only on one side of the PCI-E expander, which means all cards are driven by the same CPU.
Another machine however seems to run stable with 8 cards (uptime of about 4 months right now) with Kernel 3.19 after months of having the problems described above. Bizarre.



[1] https://devtalk.nvidia.com/default/topic/958927/gpu-job-fail-/



[2] https://devtalk.nvidia.com/default/topic/959699/linux/nvidia-smi-periodically-crashes-system-on-ubuntu-16-04-lts/







ubuntu supermicro cuda nvidia






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 13 '17 at 18:39







pks

















asked Feb 8 '17 at 11:51









pkspks

163




163












  • Does your PSU provide enough power?

    – Gerald Schneider
    Feb 8 '17 at 15:49











  • It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm

    – pks
    Feb 8 '17 at 18:42












  • We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.

    – emjotde
    May 8 '17 at 10:20











  • It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.

    – Marco
    May 8 '17 at 10:39











  • We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.

    – David Bau
    Sep 4 '17 at 2:39

















  • Does your PSU provide enough power?

    – Gerald Schneider
    Feb 8 '17 at 15:49











  • It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm

    – pks
    Feb 8 '17 at 18:42












  • We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.

    – emjotde
    May 8 '17 at 10:20











  • It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.

    – Marco
    May 8 '17 at 10:39











  • We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.

    – David Bau
    Sep 4 '17 at 2:39
















Does your PSU provide enough power?

– Gerald Schneider
Feb 8 '17 at 15:49





Does your PSU provide enough power?

– Gerald Schneider
Feb 8 '17 at 15:49













It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm

– pks
Feb 8 '17 at 18:42






It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm

– pks
Feb 8 '17 at 18:42














We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.

– emjotde
May 8 '17 at 10:20





We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.

– emjotde
May 8 '17 at 10:20













It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.

– Marco
May 8 '17 at 10:39





It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.

– Marco
May 8 '17 at 10:39













We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.

– David Bau
Sep 4 '17 at 2:39





We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.

– David Bau
Sep 4 '17 at 2:39










2 Answers
2






active

oldest

votes


















1














I had the exact same issue on the same computer. To fix this, you will need to disable the on-board VGA by changing jumper JPG1 on the motherboard. Unfortunately, you'll need to remove the daughterboard to do so. Note that, to re-install the daughterboard, you may need to apply quite a bit of pressure for it to connect properly with the motherboard again.






share|improve this answer























  • I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

    – David Steinhauer
    Apr 11 '18 at 14:47











  • With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

    – David Steinhauer
    Apr 11 '18 at 14:50


















0














There is a known issue with the PCI bus (power management) that seems to be resolved by SuperMicro. We have just received a flashable BIOS+firmware update from them and are testing it.
I don't think I can share the update (unsure about licensing) so would advise you to contact SuperMicro..






share|improve this answer

























  • Hi adev, any news about your GPU server?

    – lhlmgr
    Sep 28 '17 at 10:47











Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "2"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f831309%2f8-gpu-machine-freezes%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














I had the exact same issue on the same computer. To fix this, you will need to disable the on-board VGA by changing jumper JPG1 on the motherboard. Unfortunately, you'll need to remove the daughterboard to do so. Note that, to re-install the daughterboard, you may need to apply quite a bit of pressure for it to connect properly with the motherboard again.






share|improve this answer























  • I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

    – David Steinhauer
    Apr 11 '18 at 14:47











  • With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

    – David Steinhauer
    Apr 11 '18 at 14:50















1














I had the exact same issue on the same computer. To fix this, you will need to disable the on-board VGA by changing jumper JPG1 on the motherboard. Unfortunately, you'll need to remove the daughterboard to do so. Note that, to re-install the daughterboard, you may need to apply quite a bit of pressure for it to connect properly with the motherboard again.






share|improve this answer























  • I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

    – David Steinhauer
    Apr 11 '18 at 14:47











  • With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

    – David Steinhauer
    Apr 11 '18 at 14:50













1












1








1







I had the exact same issue on the same computer. To fix this, you will need to disable the on-board VGA by changing jumper JPG1 on the motherboard. Unfortunately, you'll need to remove the daughterboard to do so. Note that, to re-install the daughterboard, you may need to apply quite a bit of pressure for it to connect properly with the motherboard again.






share|improve this answer













I had the exact same issue on the same computer. To fix this, you will need to disable the on-board VGA by changing jumper JPG1 on the motherboard. Unfortunately, you'll need to remove the daughterboard to do so. Note that, to re-install the daughterboard, you may need to apply quite a bit of pressure for it to connect properly with the motherboard again.







share|improve this answer












share|improve this answer



share|improve this answer










answered Jun 24 '17 at 6:23









tinkerthinkertinkerthinker

212




212












  • I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

    – David Steinhauer
    Apr 11 '18 at 14:47











  • With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

    – David Steinhauer
    Apr 11 '18 at 14:50

















  • I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

    – David Steinhauer
    Apr 11 '18 at 14:47











  • With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

    – David Steinhauer
    Apr 11 '18 at 14:50
















I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

– David Steinhauer
Apr 11 '18 at 14:47





I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

– David Steinhauer
Apr 11 '18 at 14:47













With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

– David Steinhauer
Apr 11 '18 at 14:50





With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

– David Steinhauer
Apr 11 '18 at 14:50













0














There is a known issue with the PCI bus (power management) that seems to be resolved by SuperMicro. We have just received a flashable BIOS+firmware update from them and are testing it.
I don't think I can share the update (unsure about licensing) so would advise you to contact SuperMicro..






share|improve this answer

























  • Hi adev, any news about your GPU server?

    – lhlmgr
    Sep 28 '17 at 10:47















0














There is a known issue with the PCI bus (power management) that seems to be resolved by SuperMicro. We have just received a flashable BIOS+firmware update from them and are testing it.
I don't think I can share the update (unsure about licensing) so would advise you to contact SuperMicro..






share|improve this answer

























  • Hi adev, any news about your GPU server?

    – lhlmgr
    Sep 28 '17 at 10:47













0












0








0







There is a known issue with the PCI bus (power management) that seems to be resolved by SuperMicro. We have just received a flashable BIOS+firmware update from them and are testing it.
I don't think I can share the update (unsure about licensing) so would advise you to contact SuperMicro..






share|improve this answer















There is a known issue with the PCI bus (power management) that seems to be resolved by SuperMicro. We have just received a flashable BIOS+firmware update from them and are testing it.
I don't think I can share the update (unsure about licensing) so would advise you to contact SuperMicro..







share|improve this answer














share|improve this answer



share|improve this answer








edited May 22 '17 at 6:19

























answered May 16 '17 at 6:59









adevadev

11




11












  • Hi adev, any news about your GPU server?

    – lhlmgr
    Sep 28 '17 at 10:47

















  • Hi adev, any news about your GPU server?

    – lhlmgr
    Sep 28 '17 at 10:47
















Hi adev, any news about your GPU server?

– lhlmgr
Sep 28 '17 at 10:47





Hi adev, any news about your GPU server?

– lhlmgr
Sep 28 '17 at 10:47

















draft saved

draft discarded
















































Thanks for contributing an answer to Server Fault!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f831309%2f8-gpu-machine-freezes%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Club Baloncesto Breogán Índice Historia | Pavillón | Nome | O Breogán na cultura popular | Xogadores | Adestradores | Presidentes | Palmarés | Historial | Líderes | Notas | Véxase tamén | Menú de navegacióncbbreogan.galCadroGuía oficial da ACB 2009-10, páxina 201Guía oficial ACB 1992, páxina 183. Editorial DB.É de 6.500 espectadores sentados axeitándose á última normativa"Estudiantes Junior, entre as mellores canteiras"o orixinalHemeroteca El Mundo Deportivo, 16 setembro de 1970, páxina 12Historia do BreogánAlfredo Pérez, o último canoneiroHistoria C.B. BreogánHemeroteca de El Mundo DeportivoJimmy Wright, norteamericano do Breogán deixará Lugo por ameazas de morteResultados de Breogán en 1986-87Resultados de Breogán en 1990-91Ficha de Velimir Perasović en acb.comResultados de Breogán en 1994-95Breogán arrasa al Barça. "El Mundo Deportivo", 27 de setembro de 1999, páxina 58CB Breogán - FC BarcelonaA FEB invita a participar nunha nova Liga EuropeaCharlie Bell na prensa estatalMáximos anotadores 2005Tempada 2005-06 : Tódolos Xogadores da Xornada""Non quero pensar nunha man negra, mais pregúntome que está a pasar""o orixinalRaúl López, orgulloso dos xogadores, presume da boa saúde económica do BreogánJulio González confirma que cesa como presidente del BreogánHomenaxe a Lisardo GómezA tempada do rexurdimento celesteEntrevista a Lisardo GómezEl COB dinamita el Pazo para forzar el quinto (69-73)Cafés Candelas, patrocinador del CB Breogán"Suso Lázare, novo presidente do Breogán"o orixinalCafés Candelas Breogán firma el mayor triunfo de la historiaEl Breogán realizará 17 homenajes por su cincuenta aniversario"O Breogán honra ao seu fundador e primeiro presidente"o orixinalMiguel Giao recibiu a homenaxe do PazoHomenaxe aos primeiros gladiadores celestesO home que nos amosa como ver o Breo co corazónTita Franco será homenaxeada polos #50anosdeBreoJulio Vila recibirá unha homenaxe in memoriam polos #50anosdeBreo"O Breogán homenaxeará aos seus aboados máis veteráns"Pechada ovación a «Capi» Sanmartín e Ricardo «Corazón de González»Homenaxe por décadas de informaciónPaco García volve ao Pazo con motivo do 50 aniversario"Resultados y clasificaciones""O Cafés Candelas Breogán, campión da Copa Princesa""O Cafés Candelas Breogán, equipo ACB"C.B. Breogán"Proxecto social"o orixinal"Centros asociados"o orixinalFicha en imdb.comMario Camus trata la recuperación del amor en 'La vieja música', su última película"Páxina web oficial""Club Baloncesto Breogán""C. B. Breogán S.A.D."eehttp://www.fegaba.com

Vilaño, A Laracha Índice Patrimonio | Lugares e parroquias | Véxase tamén | Menú de navegación43°14′52″N 8°36′03″O / 43.24775, -8.60070

Cegueira Índice Epidemioloxía | Deficiencia visual | Tipos de cegueira | Principais causas de cegueira | Tratamento | Técnicas de adaptación e axudas | Vida dos cegos | Primeiros auxilios | Crenzas respecto das persoas cegas | Crenzas das persoas cegas | O neno deficiente visual | Aspectos psicolóxicos da cegueira | Notas | Véxase tamén | Menú de navegación54.054.154.436928256blindnessDicionario da Real Academia GalegaPortal das Palabras"International Standards: Visual Standards — Aspects and Ranges of Vision Loss with Emphasis on Population Surveys.""Visual impairment and blindness""Presentan un plan para previr a cegueira"o orixinalACCDV Associació Catalana de Cecs i Disminuïts Visuals - PMFTrachoma"Effect of gene therapy on visual function in Leber's congenital amaurosis"1844137110.1056/NEJMoa0802268Cans guía - os mellores amigos dos cegosArquivadoEscola de cans guía para cegos en Mortágua, PortugalArquivado"Tecnología para ciegos y deficientes visuales. Recopilación de recursos gratuitos en la Red""Colorino""‘COL.diesis’, escuchar los sonidos del color""COL.diesis: Transforming Colour into Melody and Implementing the Result in a Colour Sensor Device"o orixinal"Sistema de desarrollo de sinestesia color-sonido para invidentes utilizando un protocolo de audio""Enseñanza táctil - geometría y color. Juegos didácticos para niños ciegos y videntes""Sistema Constanz"L'ocupació laboral dels cecs a l'Estat espanyol està pràcticament equiparada a la de les persones amb visió, entrevista amb Pedro ZuritaONCE (Organización Nacional de Cegos de España)Prevención da cegueiraDescrición de deficiencias visuais (Disc@pnet)Braillín, un boneco atractivo para calquera neno, con ou sen discapacidade, que permite familiarizarse co sistema de escritura e lectura brailleAxudas Técnicas36838ID00897494007150-90057129528256DOID:1432HP:0000618D001766C10.597.751.941.162C97109C0155020