8 GPU machine freezesbroadcom 5722 NIC not installed on Ubuntu Server, although driver presentLinux freezes every few secondsTrouble installing GTX 480 / Tesla 2050 Dual-GPU for CUDAunexplainable packet drops with 5 ethernet NICs and low traffic on UbuntuHow AWS does GPU virtualization?How important is the CPU when building a CUDA system?Nvidia Pascal architecture: DMA Size / maximum amount of host system RAM?Why is my CUDA GPU-Util ~70% when there are “No running processes found”?Server freezes without kernel panicVanishing network connectivity in HPC cluster
What should I use to get rid of some kind of weed in my onions
My perfect evil overlord plan... or is it?
Why is the episode called "The Last of the Starks"?
Can I bring back Planetary Romance as a genre?
Expl3 and recent xparse on overleaf: No expl3 loader detected
How do I give a darkroom course without negatives from the attendees?
Is it a good idea to copy a trader when investing?
Identity of a supposed anonymous referee revealed through "Description" of the report
How to start your Starctaft II games vs AI immediatly?
Why doesn't increasing the temperature of something like wood or paper set them on fire?
Why did Ham the Chimp push levers?
Every group the homology of some space?
Linear Independence for Vectors of Cosine Values
Do these creatures from the Tomb of Annihilation campaign speak Common?
Is there a need for better software for writers?
My Sixteen Friendly Students
Is there an idiom that means "revealing a secret unintentionally"?
Magical Modulo Squares
What dice to use in a game that revolves around triangles?
Can you turn music upside down?
Align a table column at a specific symbol
What are these pads?
Steganography in Latex
Names of the Six Tastes
8 GPU machine freezes
broadcom 5722 NIC not installed on Ubuntu Server, although driver presentLinux freezes every few secondsTrouble installing GTX 480 / Tesla 2050 Dual-GPU for CUDAunexplainable packet drops with 5 ethernet NICs and low traffic on UbuntuHow AWS does GPU virtualization?How important is the CPU when building a CUDA system?Nvidia Pascal architecture: DMA Size / maximum amount of host system RAM?Why is my CUDA GPU-Util ~70% when there are “No running processes found”?Server freezes without kernel panicVanishing network connectivity in HPC cluster
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
We have a SuperMicro GPU server with:
- 2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
- 512GB memory
- more than enough disk space
- X10DRG-O+-CPU (BIOS Version : 2.0a [current])
- X9DRG-O-PCIE PCI-E expander card
- 8x GTX 1080
It is setup with Ubuntu 16.04.1 LTS, NVIDA driver 367.57 and CUDA-8.0.
When it runs, it runs temporarily fine. It is however completely useless with the stock Kernel (v4.4) -- the system almost immediately freezes when doing something non-trivial on any GPU. We therefore suspected a hardware issue, but cooling is fine, and a second almost identical machine (just different maker of the GPUs) shows the exact same behaviour.
To make it run fine for some time, you have to downgrade the Kernel to v3.14.1-trusty (tested almost every version before that one). But there are still random freezes, usually with nothing in the logs. Sometimes the whole machine freezes, other times just any GPU-related processes.
There seem to be other [1] people [2] having this issue, but no solution there.
Is anyone having the same experience with this type of machine?
Update:
The machines seem to run stable (regardless of any software) if the cards are inserted only on one side of the PCI-E expander, which means all cards are driven by the same CPU.
Another machine however seems to run stable with 8 cards (uptime of about 4 months right now) with Kernel 3.19 after months of having the problems described above. Bizarre.
[1] https://devtalk.nvidia.com/default/topic/958927/gpu-job-fail-/
[2] https://devtalk.nvidia.com/default/topic/959699/linux/nvidia-smi-periodically-crashes-system-on-ubuntu-16-04-lts/
ubuntu supermicro cuda nvidia
|
show 1 more comment
We have a SuperMicro GPU server with:
- 2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
- 512GB memory
- more than enough disk space
- X10DRG-O+-CPU (BIOS Version : 2.0a [current])
- X9DRG-O-PCIE PCI-E expander card
- 8x GTX 1080
It is setup with Ubuntu 16.04.1 LTS, NVIDA driver 367.57 and CUDA-8.0.
When it runs, it runs temporarily fine. It is however completely useless with the stock Kernel (v4.4) -- the system almost immediately freezes when doing something non-trivial on any GPU. We therefore suspected a hardware issue, but cooling is fine, and a second almost identical machine (just different maker of the GPUs) shows the exact same behaviour.
To make it run fine for some time, you have to downgrade the Kernel to v3.14.1-trusty (tested almost every version before that one). But there are still random freezes, usually with nothing in the logs. Sometimes the whole machine freezes, other times just any GPU-related processes.
There seem to be other [1] people [2] having this issue, but no solution there.
Is anyone having the same experience with this type of machine?
Update:
The machines seem to run stable (regardless of any software) if the cards are inserted only on one side of the PCI-E expander, which means all cards are driven by the same CPU.
Another machine however seems to run stable with 8 cards (uptime of about 4 months right now) with Kernel 3.19 after months of having the problems described above. Bizarre.
[1] https://devtalk.nvidia.com/default/topic/958927/gpu-job-fail-/
[2] https://devtalk.nvidia.com/default/topic/959699/linux/nvidia-smi-periodically-crashes-system-on-ubuntu-16-04-lts/
ubuntu supermicro cuda nvidia
Does your PSU provide enough power?
– Gerald Schneider
Feb 8 '17 at 15:49
It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm
– pks
Feb 8 '17 at 18:42
We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.
– emjotde
May 8 '17 at 10:20
It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.
– Marco
May 8 '17 at 10:39
We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.
– David Bau
Sep 4 '17 at 2:39
|
show 1 more comment
We have a SuperMicro GPU server with:
- 2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
- 512GB memory
- more than enough disk space
- X10DRG-O+-CPU (BIOS Version : 2.0a [current])
- X9DRG-O-PCIE PCI-E expander card
- 8x GTX 1080
It is setup with Ubuntu 16.04.1 LTS, NVIDA driver 367.57 and CUDA-8.0.
When it runs, it runs temporarily fine. It is however completely useless with the stock Kernel (v4.4) -- the system almost immediately freezes when doing something non-trivial on any GPU. We therefore suspected a hardware issue, but cooling is fine, and a second almost identical machine (just different maker of the GPUs) shows the exact same behaviour.
To make it run fine for some time, you have to downgrade the Kernel to v3.14.1-trusty (tested almost every version before that one). But there are still random freezes, usually with nothing in the logs. Sometimes the whole machine freezes, other times just any GPU-related processes.
There seem to be other [1] people [2] having this issue, but no solution there.
Is anyone having the same experience with this type of machine?
Update:
The machines seem to run stable (regardless of any software) if the cards are inserted only on one side of the PCI-E expander, which means all cards are driven by the same CPU.
Another machine however seems to run stable with 8 cards (uptime of about 4 months right now) with Kernel 3.19 after months of having the problems described above. Bizarre.
[1] https://devtalk.nvidia.com/default/topic/958927/gpu-job-fail-/
[2] https://devtalk.nvidia.com/default/topic/959699/linux/nvidia-smi-periodically-crashes-system-on-ubuntu-16-04-lts/
ubuntu supermicro cuda nvidia
We have a SuperMicro GPU server with:
- 2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
- 512GB memory
- more than enough disk space
- X10DRG-O+-CPU (BIOS Version : 2.0a [current])
- X9DRG-O-PCIE PCI-E expander card
- 8x GTX 1080
It is setup with Ubuntu 16.04.1 LTS, NVIDA driver 367.57 and CUDA-8.0.
When it runs, it runs temporarily fine. It is however completely useless with the stock Kernel (v4.4) -- the system almost immediately freezes when doing something non-trivial on any GPU. We therefore suspected a hardware issue, but cooling is fine, and a second almost identical machine (just different maker of the GPUs) shows the exact same behaviour.
To make it run fine for some time, you have to downgrade the Kernel to v3.14.1-trusty (tested almost every version before that one). But there are still random freezes, usually with nothing in the logs. Sometimes the whole machine freezes, other times just any GPU-related processes.
There seem to be other [1] people [2] having this issue, but no solution there.
Is anyone having the same experience with this type of machine?
Update:
The machines seem to run stable (regardless of any software) if the cards are inserted only on one side of the PCI-E expander, which means all cards are driven by the same CPU.
Another machine however seems to run stable with 8 cards (uptime of about 4 months right now) with Kernel 3.19 after months of having the problems described above. Bizarre.
[1] https://devtalk.nvidia.com/default/topic/958927/gpu-job-fail-/
[2] https://devtalk.nvidia.com/default/topic/959699/linux/nvidia-smi-periodically-crashes-system-on-ubuntu-16-04-lts/
ubuntu supermicro cuda nvidia
ubuntu supermicro cuda nvidia
edited Apr 13 '17 at 18:39
pks
asked Feb 8 '17 at 11:51
pkspks
163
163
Does your PSU provide enough power?
– Gerald Schneider
Feb 8 '17 at 15:49
It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm
– pks
Feb 8 '17 at 18:42
We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.
– emjotde
May 8 '17 at 10:20
It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.
– Marco
May 8 '17 at 10:39
We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.
– David Bau
Sep 4 '17 at 2:39
|
show 1 more comment
Does your PSU provide enough power?
– Gerald Schneider
Feb 8 '17 at 15:49
It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm
– pks
Feb 8 '17 at 18:42
We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.
– emjotde
May 8 '17 at 10:20
It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.
– Marco
May 8 '17 at 10:39
We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.
– David Bau
Sep 4 '17 at 2:39
Does your PSU provide enough power?
– Gerald Schneider
Feb 8 '17 at 15:49
Does your PSU provide enough power?
– Gerald Schneider
Feb 8 '17 at 15:49
It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm
– pks
Feb 8 '17 at 18:42
It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm
– pks
Feb 8 '17 at 18:42
We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.
– emjotde
May 8 '17 at 10:20
We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.
– emjotde
May 8 '17 at 10:20
It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.
– Marco
May 8 '17 at 10:39
It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.
– Marco
May 8 '17 at 10:39
We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.
– David Bau
Sep 4 '17 at 2:39
We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.
– David Bau
Sep 4 '17 at 2:39
|
show 1 more comment
2 Answers
2
active
oldest
votes
I had the exact same issue on the same computer. To fix this, you will need to disable the on-board VGA by changing jumper JPG1 on the motherboard. Unfortunately, you'll need to remove the daughterboard to do so. Note that, to re-install the daughterboard, you may need to apply quite a bit of pressure for it to connect properly with the motherboard again.
I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!
– David Steinhauer
Apr 11 '18 at 14:47
With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.
– David Steinhauer
Apr 11 '18 at 14:50
add a comment |
There is a known issue with the PCI bus (power management) that seems to be resolved by SuperMicro. We have just received a flashable BIOS+firmware update from them and are testing it.
I don't think I can share the update (unsure about licensing) so would advise you to contact SuperMicro..
Hi adev, any news about your GPU server?
– lhlmgr
Sep 28 '17 at 10:47
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "2"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f831309%2f8-gpu-machine-freezes%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
I had the exact same issue on the same computer. To fix this, you will need to disable the on-board VGA by changing jumper JPG1 on the motherboard. Unfortunately, you'll need to remove the daughterboard to do so. Note that, to re-install the daughterboard, you may need to apply quite a bit of pressure for it to connect properly with the motherboard again.
I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!
– David Steinhauer
Apr 11 '18 at 14:47
With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.
– David Steinhauer
Apr 11 '18 at 14:50
add a comment |
I had the exact same issue on the same computer. To fix this, you will need to disable the on-board VGA by changing jumper JPG1 on the motherboard. Unfortunately, you'll need to remove the daughterboard to do so. Note that, to re-install the daughterboard, you may need to apply quite a bit of pressure for it to connect properly with the motherboard again.
I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!
– David Steinhauer
Apr 11 '18 at 14:47
With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.
– David Steinhauer
Apr 11 '18 at 14:50
add a comment |
I had the exact same issue on the same computer. To fix this, you will need to disable the on-board VGA by changing jumper JPG1 on the motherboard. Unfortunately, you'll need to remove the daughterboard to do so. Note that, to re-install the daughterboard, you may need to apply quite a bit of pressure for it to connect properly with the motherboard again.
I had the exact same issue on the same computer. To fix this, you will need to disable the on-board VGA by changing jumper JPG1 on the motherboard. Unfortunately, you'll need to remove the daughterboard to do so. Note that, to re-install the daughterboard, you may need to apply quite a bit of pressure for it to connect properly with the motherboard again.
answered Jun 24 '17 at 6:23
tinkerthinkertinkerthinker
212
212
I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!
– David Steinhauer
Apr 11 '18 at 14:47
With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.
– David Steinhauer
Apr 11 '18 at 14:50
add a comment |
I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!
– David Steinhauer
Apr 11 '18 at 14:47
With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.
– David Steinhauer
Apr 11 '18 at 14:50
I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!
– David Steinhauer
Apr 11 '18 at 14:47
I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble!
– David Steinhauer
Apr 11 '18 at 14:47
With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.
– David Steinhauer
Apr 11 '18 at 14:50
With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs.
– David Steinhauer
Apr 11 '18 at 14:50
add a comment |
There is a known issue with the PCI bus (power management) that seems to be resolved by SuperMicro. We have just received a flashable BIOS+firmware update from them and are testing it.
I don't think I can share the update (unsure about licensing) so would advise you to contact SuperMicro..
Hi adev, any news about your GPU server?
– lhlmgr
Sep 28 '17 at 10:47
add a comment |
There is a known issue with the PCI bus (power management) that seems to be resolved by SuperMicro. We have just received a flashable BIOS+firmware update from them and are testing it.
I don't think I can share the update (unsure about licensing) so would advise you to contact SuperMicro..
Hi adev, any news about your GPU server?
– lhlmgr
Sep 28 '17 at 10:47
add a comment |
There is a known issue with the PCI bus (power management) that seems to be resolved by SuperMicro. We have just received a flashable BIOS+firmware update from them and are testing it.
I don't think I can share the update (unsure about licensing) so would advise you to contact SuperMicro..
There is a known issue with the PCI bus (power management) that seems to be resolved by SuperMicro. We have just received a flashable BIOS+firmware update from them and are testing it.
I don't think I can share the update (unsure about licensing) so would advise you to contact SuperMicro..
edited May 22 '17 at 6:19
answered May 16 '17 at 6:59
adevadev
11
11
Hi adev, any news about your GPU server?
– lhlmgr
Sep 28 '17 at 10:47
add a comment |
Hi adev, any news about your GPU server?
– lhlmgr
Sep 28 '17 at 10:47
Hi adev, any news about your GPU server?
– lhlmgr
Sep 28 '17 at 10:47
Hi adev, any news about your GPU server?
– lhlmgr
Sep 28 '17 at 10:47
add a comment |
Thanks for contributing an answer to Server Fault!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f831309%2f8-gpu-machine-freezes%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Does your PSU provide enough power?
– Gerald Schneider
Feb 8 '17 at 15:49
It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm
– pks
Feb 8 '17 at 18:42
We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes.
– emjotde
May 8 '17 at 10:20
It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then.
– Marco
May 8 '17 at 10:39
We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem.
– David Bau
Sep 4 '17 at 2:39