8 GPU machine freezesbroadcom 5722 NIC not installed on Ubuntu Server, although driver presentLinux freezes every few secondsTrouble installing GTX 480 / Tesla 2050 Dual-GPU for CUDAunexplainable packet drops with 5 ethernet NICs and low traffic on UbuntuHow AWS does GPU virtualization?How important is the CPU when building a CUDA system?Nvidia Pascal architecture: DMA Size / maximum amount of host system RAM?Why is my CUDA GPU-Util ~70% when there are “No running processes found”?Server freezes without kernel panicVanishing network connectivity in HPC cluster

What should I use to get rid of some kind of weed in my onions

My perfect evil overlord plan... or is it?

Why is the episode called "The Last of the Starks"?

Can I bring back Planetary Romance as a genre?

Expl3 and recent xparse on overleaf: No expl3 loader detected

How do I give a darkroom course without negatives from the attendees?

Is it a good idea to copy a trader when investing?

Identity of a supposed anonymous referee revealed through "Description" of the report

How to start your Starctaft II games vs AI immediatly?

Why doesn't increasing the temperature of something like wood or paper set them on fire?

Why did Ham the Chimp push levers?

Every group the homology of some space?

Linear Independence for Vectors of Cosine Values

Do these creatures from the Tomb of Annihilation campaign speak Common?

Is there a need for better software for writers?

My Sixteen Friendly Students

Is there an idiom that means "revealing a secret unintentionally"?

Magical Modulo Squares

What dice to use in a game that revolves around triangles?

Can you turn music upside down?

Align a table column at a specific symbol

What are these pads?

Steganography in Latex

Names of the Six Tastes

8 GPU machine freezes

broadcom 5722 NIC not installed on Ubuntu Server, although driver presentLinux freezes every few secondsTrouble installing GTX 480 / Tesla 2050 Dual-GPU for CUDAunexplainable packet drops with 5 ethernet NICs and low traffic on UbuntuHow AWS does GPU virtualization?How important is the CPU when building a CUDA system?Nvidia Pascal architecture: DMA Size / maximum amount of host system RAM?Why is my CUDA GPU-Util ~70% when there are “No running processes found”?Server freezes without kernel panicVanishing network connectivity in HPC cluster

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

We have a SuperMicro GPU server with:

2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz

512GB memory

more than enough disk space

X10DRG-O+-CPU (BIOS Version : 2.0a [current])

X9DRG-O-PCIE PCI-E expander card

8x GTX 1080

It is setup with Ubuntu 16.04.1 LTS, NVIDA driver 367.57 and CUDA-8.0.
When it runs, it runs temporarily fine. It is however completely useless with the stock Kernel (v4.4) -- the system almost immediately freezes when doing something non-trivial on any GPU. We therefore suspected a hardware issue, but cooling is fine, and a second almost identical machine (just different maker of the GPUs) shows the exact same behaviour.

To make it run fine for some time, you have to downgrade the Kernel to v3.14.1-trusty (tested almost every version before that one). But there are still random freezes, usually with nothing in the logs. Sometimes the whole machine freezes, other times just any GPU-related processes.

There seem to be other [1] people [2] having this issue, but no solution there.

Is anyone having the same experience with this type of machine?

Update:
The machines seem to run stable (regardless of any software) if the cards are inserted only on one side of the PCI-E expander, which means all cards are driven by the same CPU.
Another machine however seems to run stable with 8 cards (uptime of about 4 months right now) with Kernel 3.19 after months of having the problems described above. Bizarre.

[1] https://devtalk.nvidia.com/default/topic/958927/gpu-job-fail-/

[2] https://devtalk.nvidia.com/default/topic/959699/linux/nvidia-smi-periodically-crashes-system-on-ubuntu-16-04-lts/

edited Apr 13 '17 at 18:39

asked Feb 8 '17 at 11:51

pks

163

Does your PSU provide enough power?

– Gerald Schneider
Feb 8 '17 at 15:49

It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm

– pks
Feb 8 '17 at 18:42

We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.

– emjotde
May 8 '17 at 10:20

It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.

– Marco
May 8 '17 at 10:39

We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.

– David Bau
Sep 4 '17 at 2:39

|
show 1 more comment

We have a SuperMicro GPU server with:

2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz

512GB memory

more than enough disk space

X10DRG-O+-CPU (BIOS Version : 2.0a [current])

X9DRG-O-PCIE PCI-E expander card

8x GTX 1080

There seem to be other [1] people [2] having this issue, but no solution there.

Is anyone having the same experience with this type of machine?

[1] https://devtalk.nvidia.com/default/topic/958927/gpu-job-fail-/

[2] https://devtalk.nvidia.com/default/topic/959699/linux/nvidia-smi-periodically-crashes-system-on-ubuntu-16-04-lts/

edited Apr 13 '17 at 18:39

asked Feb 8 '17 at 11:51

pks

163

Does your PSU provide enough power?

– Gerald Schneider
Feb 8 '17 at 15:49

It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm

– pks
Feb 8 '17 at 18:42

We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.

– emjotde
May 8 '17 at 10:20

It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.

– Marco
May 8 '17 at 10:39

We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.

– David Bau
Sep 4 '17 at 2:39

|
show 1 more comment

We have a SuperMicro GPU server with:

2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz

512GB memory

more than enough disk space

X10DRG-O+-CPU (BIOS Version : 2.0a [current])

X9DRG-O-PCIE PCI-E expander card

8x GTX 1080

There seem to be other [1] people [2] having this issue, but no solution there.

Is anyone having the same experience with this type of machine?

[1] https://devtalk.nvidia.com/default/topic/958927/gpu-job-fail-/

[2] https://devtalk.nvidia.com/default/topic/959699/linux/nvidia-smi-periodically-crashes-system-on-ubuntu-16-04-lts/

edited Apr 13 '17 at 18:39

asked Feb 8 '17 at 11:51

pks

163

We have a SuperMicro GPU server with:

2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz

512GB memory

more than enough disk space

X10DRG-O+-CPU (BIOS Version : 2.0a [current])

X9DRG-O-PCIE PCI-E expander card

8x GTX 1080

There seem to be other [1] people [2] having this issue, but no solution there.

Is anyone having the same experience with this type of machine?

[1] https://devtalk.nvidia.com/default/topic/958927/gpu-job-fail-/

[2] https://devtalk.nvidia.com/default/topic/959699/linux/nvidia-smi-periodically-crashes-system-on-ubuntu-16-04-lts/

ubuntu supermicro cuda nvidia

edited Apr 13 '17 at 18:39

asked Feb 8 '17 at 11:51

pks

163

edited Apr 13 '17 at 18:39

asked Feb 8 '17 at 11:51

pks

163

edited Apr 13 '17 at 18:39

asked Feb 8 '17 at 11:51

pks

163

asked Feb 8 '17 at 11:51

pks

163

asked Feb 8 '17 at 11:51

pks

163

Does your PSU provide enough power?

– Gerald Schneider
Feb 8 '17 at 15:49

It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm

– pks
Feb 8 '17 at 18:42

We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.

– emjotde
May 8 '17 at 10:20

It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.

– Marco
May 8 '17 at 10:39

We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.

– David Bau
Sep 4 '17 at 2:39

|
show 1 more comment

Does your PSU provide enough power?

– Gerald Schneider
Feb 8 '17 at 15:49

It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm

– pks
Feb 8 '17 at 18:42

We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.

– emjotde
May 8 '17 at 10:20

It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.

– Marco
May 8 '17 at 10:39

We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.

– David Bau
Sep 4 '17 at 2:39

Does your PSU provide enough power?

– Gerald Schneider
Feb 8 '17 at 15:49

It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm

– pks
Feb 8 '17 at 18:42

We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.

– emjotde
May 8 '17 at 10:20

It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.

– Marco
May 8 '17 at 10:39

We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.

– David Bau
Sep 4 '17 at 2:39

|
show 1 more comment

2 Answers
2

active

oldest

votes

I had the exact same issue on the same computer. To fix this, you will need to disable the on-board VGA by changing jumper JPG1 on the motherboard. Unfortunately, you'll need to remove the daughterboard to do so. Note that, to re-install the daughterboard, you may need to apply quite a bit of pressure for it to connect properly with the motherboard again.

answered Jun 24 '17 at 6:23

tinkerthinker

212

I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

– David Steinhauer
Apr 11 '18 at 14:47

With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

– David Steinhauer
Apr 11 '18 at 14:50

add a comment |

There is a known issue with the PCI bus (power management) that seems to be resolved by SuperMicro. We have just received a flashable BIOS+firmware update from them and are testing it.
I don't think I can share the update (unsure about licensing) so would advise you to contact SuperMicro..

edited May 22 '17 at 6:19

answered May 16 '17 at 6:59

adev

Hi adev, any news about your GPU server?

– lhlmgr
Sep 28 '17 at 10:47

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "2"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f831309%2f8-gpu-machine-freezes%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

answered Jun 24 '17 at 6:23

tinkerthinker

212

I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

– David Steinhauer
Apr 11 '18 at 14:47

With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

– David Steinhauer
Apr 11 '18 at 14:50

add a comment |

answered Jun 24 '17 at 6:23

tinkerthinker

212

I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

– David Steinhauer
Apr 11 '18 at 14:47

With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

– David Steinhauer
Apr 11 '18 at 14:50

add a comment |

answered Jun 24 '17 at 6:23

tinkerthinker

212

answered Jun 24 '17 at 6:23

tinkerthinker

212

answered Jun 24 '17 at 6:23

tinkerthinker

212

answered Jun 24 '17 at 6:23

tinkerthinker

212

answered Jun 24 '17 at 6:23

tinkerthinker

212

I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

– David Steinhauer
Apr 11 '18 at 14:47

With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

– David Steinhauer
Apr 11 '18 at 14:50

add a comment |

I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

– David Steinhauer
Apr 11 '18 at 14:47

With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

– David Steinhauer
Apr 11 '18 at 14:50

I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!

– David Steinhauer
Apr 11 '18 at 14:47

With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.

– David Steinhauer
Apr 11 '18 at 14:50

add a comment |

edited May 22 '17 at 6:19

answered May 16 '17 at 6:59

adev

Hi adev, any news about your GPU server?

– lhlmgr
Sep 28 '17 at 10:47

add a comment |

edited May 22 '17 at 6:19

answered May 16 '17 at 6:59

adev

Hi adev, any news about your GPU server?

– lhlmgr
Sep 28 '17 at 10:47

add a comment |

edited May 22 '17 at 6:19

answered May 16 '17 at 6:59

adev

edited May 22 '17 at 6:19

answered May 16 '17 at 6:59

adev

edited May 22 '17 at 6:19

answered May 16 '17 at 6:59

adev

answered May 16 '17 at 6:59

adev

answered May 16 '17 at 6:59

adev

Hi adev, any news about your GPU server?

– lhlmgr
Sep 28 '17 at 10:47

add a comment |

Hi adev, any news about your GPU server?

– lhlmgr
Sep 28 '17 at 10:47

Hi adev, any news about your GPU server?

– lhlmgr
Sep 28 '17 at 10:47

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Server Fault!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f831309%2f8-gpu-machine-freezes%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Otdfbt

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

2 Answers
2

2 Answers
2

2 Answers
2