Otdfbt

Question

I am currently performing number-crunching using CUDA on my GPU, an NVIDIA GeForce GTX 1050 Ti. These operations often take months to complete, and during that time I leave my PC on 24/7.

Is doing so safe? Am I risking a potential overheating of my graphics card that might result in (worst case scenario) a house fire?

Note that the PC is correctly ventilated and there is no obstruction to its air flow.

With a proper cooling the only issue will be electricity bills. — May 6 at 9:14
The biggest question is, what level of risk is acceptable to you? You have to consider cases where other system components (e.g. one of the fans) fails while you're unavailable to monitor it. For systems where the tolerated risk is extremely low, companies might spend much more on duplicate cooling systems, automated fire-suppression devices, and monitoring systems. — May 6 at 15:09
Also, it might be cheaper to use someone else's hardware (maybe a local university?) than to do it on your own hardware, especially once you factor in your comfort level with the amount of risk involved. — May 6 at 15:22
Anecdotally over the last decade I've ran 8 graphics cards at high 24/7 compute loads for at least 3 years each. During that time I've had a single fan failure (AMD 5850) after ~2 years and a single card failure (NVidia 560) after 4(?) years. — May 6 at 17:52
@EricDuminil Is the heat expended by a chip not proportional with the % utilisation? — May 6 at 23:36

Ender 132 · Accepted Answer · 2019-05-08 02:43:11Z

Short answer: This should be safe on well-designed hardware.

Long answer:
The GPU (and its software environment: drivers, OS, daemons) are designed to protect from overheating - the GPU should first turn the fans to a higher RPM, if that can't keep a safe temperature then the GPU throttles the workload (usually by reducing the clock frequency). This will assure a heat profile that will not damage the GPU and thus not the PC (or the room).

Caveat: There exist cheap knock-off graphic cards, where the firmware is specifically designed to sacrifice safety for performance. While I don't think those exist for a 1050, I am not 100% sure. You should also prefer the Nvidia drivers downloaded from their website over "optimized" vendor drivers, which might do the same thing.

It's not just "cheap knock-offs". I've seen six(!) completely independent GeForce 7600GS's from a reputable manufacturer die in the same way due to a presumably inadequate cooling design. This was a fanless "super silent" card used for office work or at most light gaming. However, high-end parts will likely be designed to cope with greater thermal abuse, although likely not for 24/7 loads. — May 6 at 9:42
@Klangen The PSU differs from the GPU in that it is usually (apart from servers with a BMC) not actively monitored for temperature. That said, PSUs are designed to "fail safe", i.e. if they fail fail in a way, that they do not create additional damage. — May 6 at 10:02
Because crypto-mining is the canonical example of sacrificing safety against performance. — May 6 at 10:13
Anecdotal evidence: On my older desktop computer, I've managed to fry both available CPU fan power ports (too much dust), so I decided to see if I could run the machine without a CPU fan by keeping a close eye on the CPU temperature. It hit 90C in a couple of minutes, then slowed down significantly. It was a Pentium, I believe. — May 6 at 10:13
@JohnDvorak Yes, all non-ancient CPUs employ a similar method — May 6 at 10:16

TooTeaTooTea 2295 · Accepted Answer · 2019-05-06 09:55:41Z

A house fire is extremely unlikely, but the lifespan of the card may be reduced.

Long-term overheating of the GPU chip probably won't start a fire. The chip may deteriorate and start misbehaving or die completely, but silicon chips aren't too flammable. Bad things usually happen when electrolytic capacitors fail and blow up, but these won't be subject to overheating just because the card is doing a lot of crunching and you also hopefully have a metal PC case to contain the hot shrapnel that results from such failures.

However, consumer-grade parts aren't in general designed for long-term 24/7 loads. It is thus fairly likely that the card will die sooner than if it wasn't subject to such loads. It is hard to say how much sooner without having some more statistics on a given model. Some people in the HPC community advocate using high-end gaming GPUs instead of special HPC compute parts, and there seems to be some economical sense in that. Although the commodity parts die in a year or so, it's cheaper to keep replacing them because they're many times cheaper than the alternative

Mechanical stress is the worst when heating up and cooling down, rather than running at a constant temperature. What the OP is planning to do is no worse than playing GPU-intensive games every day for a few months. — May 6 at 11:55
Unfortunately, Nvidia's license prevents a HPC data center from using consumer-grade gaming GPUs in their servers. We're required to use higher-end GPUs and I currently have a order in for P100's when the researchers would actually prefer 1080Ti cards. — May 6 at 16:11
@technical_difficulty The EULA for the nVidia driver says that it doesn't apply to datacenter HPC use (that's certainly a concern for large centers, but it doesn't stop people building in-house HPC clusters from consumer parts). There's a decent writeup here: microway.com/knowledge-center-articles/… — May 6 at 19:28
@user912264 Yes, that's what I meant, although presumably with a poor choice of words on my end. I'm by no means a lawyer or a licensing expert. My point is that you're not allowed to use the driver in such a situation (because you can't rely on the normal free license). — May 6 at 20:06
Also note that you get more wear on the GPU fans from keeping them at a high speed. — May 6 at 21:54

Sean HoulihaneSean Houlihane 1614 · Accepted Answer · 2019-05-07 08:35:43Z

Yes, the card is likely to wear out sooner if it is under constant load. At small geometries, Electromigration is a significant source of device failures, and devices will typically be designed with a specific target lifetime in mind. This might be generous for typical operation (e.g. 5 years continuous operation), but might not assume 100% maximum operating point for all of that time. As soon as you start over-clocking, you can expect that target to reduce significantly. (Equally, running at only 80% load would maybe double the lifetime due to this failure mechanism).

There are of course other failures which relate to running components hot, or thermal cycling, this is just to point out that modern electronics (and even 1980's electronics when badly designed) can be suceptible to 'wearing out'.

But would it wear out more doing the same workload over a shorter period of time? In other words, would running at 100% for some period be worse than running it at an average of 50% for double that time? — May 8 at 4:34
Yes, exactly. Hotter, or higher voltages mean higher fatigue for the same 'work'. To a first order approximation, for this effect. — May 8 at 6:19

tahreytahrey 1312 · Accepted Answer · 2019-05-07 18:57:05Z

If your cooling system works OK, and your hardware is of any kind of even vaguely modern design that includes on-chip temperature monitoring and thermal throttling/suspend/shutdown, then it's entirely safe. It can't overheat so long as the cooler keeps running, and if that fails, the chips will throttle back until they're no longer producing more heat than can be passively dissipated (which may mean having to suspend completely, appearing like a hang/crash).

Worst case, if the throttling doesn't kick in fast and hard enough to compensate for accumulated thermal load, some part of the chip may end up melting or burning out, and you'll end up with a dead board, but by that point the throttling circuitry should have slammed into complete emergency shutdown, maybe even tripping a (temporary or permanent) fuse on the power rail, preventing any kind of runaway dumping of the entire input voltage randomly across the die and an actual fire.

Thankfully, the PC platform worked out most of the kinks in that kind of thermal protection system 10-15 years ago, after the minor scandal of some mid generation PIIIs and Athlons proving entirely capable of completely smoking themselves (and thus being a fire risk) if the cooler failed or fell off whilst the CPU was running at full tilt. One generation of chips later and it could be easily demonstrated that an overclocked high-end processor barely exceeded the maximum rated temperature at the heat spreader surface if you tore the heatsink and fan off right in the middle of a heavy benchmark... the computer slowed to a crawl or even suffered a "fatal" (to the software; the hardware just needed the HSF replaced and rebooting) crash, but the chips survived and no risk arose. Hopefully any GPU maker worth their salt isn't going to be a decade and a half behind the curve, especially when their products can already run pretty close to their rated limits temperature-wise.

However, that doesn't make this kind of treatment entirely "safe" for the transistors on the chip. Heavyweight "number crunching" (Bitcoin? Protein folding?) using GPUs is by now a rather infamous way of literally wearing out the silicon. The combination of high voltage and current, continual switching billions of times per second, plus sustained high temperatures stress the components quite a bit, both the chips and the support parts like capacitors, so their operating lifetime can be reduced to barely two years in some cases, at least at full speed. They can then run on a bit longer if derated (maximum clock speed limited etc) and employed for less demanding purposes, like last year's games, but are on borrowed time once they start erroring out at maximum speed.

So it's not going to catch on fire, but I wouldn't bank on the card still being reliable past its third birthday in that employment...

Crypto mining especially tends to operate multiple cards packed onto one mobo with inadequate airflow, resulting in high temperatures. Using one card in a good tower case with proper airflow should be significantly less stressful, although there is wear and tear on the fans. And as Sean's and your answers point out, electromigration from being powered up to full voltage can still be a concern even if temperatures are kept in check. — May 7 at 20:34

bpalijbpalij 113 · Accepted Answer · 2019-05-08 08:59:38Z

As you have mentioned, ventilation is good, so no need to worry about this factor of risk.

Talking about the GPU, it will be worn stronger, than on usual office work for 8-16 hours a day, so when using on 100% 24/7/365 it is unlikely it will be able to work for 5-10 years and more. But you must also consider that the GPU can have a poor design of the cooling system of the GPU itself (not a PC overall), a bad overall design, software and firmware bugs, bad production quality or production defect(s) with different severity and defect rate - from single-instance defects to massive ones. These factors can make the heating worse, cause system failure, little lifetime, short-circuiting or even might cause a fire or make you electric struck. Some factors depend on the model and the revision, some are being gradually fixed with the software/firmware updates, some vary from one single item to another. Better choose models with proven reliability reputation with a proper revision (usually the latest possible). Also, it can have a bad influence and interfere badly with the other components, for example, by generating extra electric/electronic signal noise. Also, do not forget about the fact, that the thermal paste can gradually lose its qualities and make cooling worse.

I must mention, that the graphics card is not the only component to be considered, because a PC is a complex system and its successful work depends on the state of multiple components. Every single little, even if unnecessary and unused, bad component, even the floppy drive or some decorative lights may break the PC down or cause the problems close to the ones mentioned about the GPU. For example, a bad on/off button may cause shutdown or reboot. And now more deep about the key components:

CPU: in your use case it is likely to be used not harder than during ordinary day-to-day usage and it is likely that you absolutely do not need to overclock it. Nowadays CPUs feature all defensive mechanisms like throttling and emergency shutdown and are considered to be pretty durable. Just do not forget about the cooler and thermal paste and it is very unlikely to be the weakest point of the system.

Motherboard: almost the same as the CPU, but there is heavy usage of PCI-e and maybe heavy usage of disks, network and peripherals, but better choose proven models.

RAM: It is extremely unlikely to break, so this risk is not worthy of being worried about. Just use a good one.

Disks: in the tasks that rely on disk usage (like data mining, data processing, learning a neural network with the data on the disk) HDD can become a weak point in reliability - in servers and data centres it is pretty common to change a disk in 1-3 years and very rarely "live" 5 years or more. You can use RAID 1 and backup systems to increase reliability at 24/7/365 usage (RAID 0 sacrifices reliability for performance, other RAIDs can take a lot of time to restore data. Also RAID != back up, so do not neglect with backups, if required). When using SSD, operations, that are heavy on disk-writing can drain the terabytes-written limit and make the disk useless - prefer TBW over other features. RAID 1 with SSDs can defend the system against sudden failures of one disk, but do not help with TBW rate. HDD or SSD - depends on your needs, budget and choice. Better choose models with proven reliability reputation with a proper revision (usually the latest possible).

Power block: is heavily used by a graphics card and therefore worn more intensively - so better choose models with proven reliability reputation with a proper revision (usually the latest possible) and the power at least 1.5x more than the overall system consumption or at least 2x-2.5x more, than the main power consumers (as the GPU and the CPU). Be sure to use a good 220V AC cable, because of bad 220V AC cables are likely to cause short-circuiting, electric struck or burning (can just make smoke and self-destroy or set a real fire)!

Ventilators: while may seem insignificant they are crucial in such use-cases and their failure is a big problem for 24/7/365 systems. Generally, install as many as you can, but also consider the size - bigger ones are quieter and more effective while the smaller ones in some cases can be installed in a bigger amount, so the failure of one single ventilator will be less painful for the system - the choice is yours.

Exotic cooling systems: water cooling is considered to be compact and effective in high-heated overclocked systems, but water leakage can cause serious damage to PC`s components. Frozen nitrogen systems are extremely effective but likely not to be required, but are more bulky and expensive.

Professional enterprise 24/7/365 systems and components are better designed for that and have a reserve on all the components, even CPUs and BIOSes, and feature hot-replacement of components or modules, but even they do not feature 100% uptime (close, but not equal), professional Nvidia cards are faster for CUDA (especially neural networks) but I do not think it is your use case.

Assembling the system is not less important, than the components themselves. Do not forget about any single action, do not make something wrong, do not make a PC like a stupid and everything must be fine.

Make sure no software will forcibly shut down, reboot the PC or kill the process. If you are a Win10 user, you may think there is no way of entirely disabling the updates, but there are workarounds and pieces of software on the Web for that (Warning: it can violate EULA).

Peripherals can also cause problems, like the PC`s components. For example, a bad or worn mouse can register a button press when there is no press.

About key external circumstances:

Electricity: I hope the electricity in your house is very reliable and stable because switching off electricity can make you lose the results of your work. With short-time electric problems, UPS can help you, but with more long-time issues it can give you only time to hibernate the system or to save your progress correctly.

Network: if your task relies on the Internet or network connection, check if the wires/modem/router is ok.

Summing up: There is no solid warranty that everything will be good (literally, only death is guaranteed) and anyway you must accept the risks (they never will be equal to zero), but having a good choice of components, proper assembling and not having bad luck in buying defected components allows you use the PC that way with lower risk, then the question author initially assumed, unless you are going to do it for years and years and expect reliability for 5, 10 and more years.

Agent_LAgent_L 1,318813 · Accepted Answer · 2019-05-08 18:26:21Z

Is it safe to keep the GPU on 100% utilization for a very long time?

Yes. It's actually safer than using it for the intended purpose, that is playing a game once in a while.

The most wear (of the electronics) comes from mechanical stress from changing temperature. The components heat up at different rates, their thermal expansion coefficients are different, therefore every heat up, cool down cycle results in forces that try to tear the card apart, often resulting in micro-damages that accumulate and can eventually lead to failure. Don't be alarmed, it's supposed to take decades. (Unlike the infamous 2006 nVidia laptop GPUs that used wrong solder so the failures occurred soon enough to be noticeable within component's lifetime)

If you start your computation and keep them at constant rate, it's actually less stressful to the card, as it warms up and then stays there, without the thermal cycles.

The only parts that will see increased wear are the fans, which are usually easy to replace.

As to your plan on actual 100% utilization - 100% is inefficient. Learn from the lesson that cryptominers taught us: as you underclock and undervolt the card, the flops go down, but consumed power goes down even more. You'll get more performance per watt. And even better lifespan.

Ender 132 · Accepted Answer · 2019-05-08 02:43:11Z

Short answer: This should be safe on well-designed hardware.

Long answer:
The GPU (and its software environment: drivers, OS, daemons) are designed to protect from overheating - the GPU should first turn the fans to a higher RPM, if that can't keep a safe temperature then the GPU throttles the workload (usually by reducing the clock frequency). This will assure a heat profile that will not damage the GPU and thus not the PC (or the room).

Caveat: There exist cheap knock-off graphic cards, where the firmware is specifically designed to sacrifice safety for performance. While I don't think those exist for a 1050, I am not 100% sure. You should also prefer the Nvidia drivers downloaded from their website over "optimized" vendor drivers, which might do the same thing.

It's not just "cheap knock-offs". I've seen six(!) completely independent GeForce 7600GS's from a reputable manufacturer die in the same way due to a presumably inadequate cooling design. This was a fanless "super silent" card used for office work or at most light gaming. However, high-end parts will likely be designed to cope with greater thermal abuse, although likely not for 24/7 loads. — May 6 at 9:42
@Klangen The PSU differs from the GPU in that it is usually (apart from servers with a BMC) not actively monitored for temperature. That said, PSUs are designed to "fail safe", i.e. if they fail fail in a way, that they do not create additional damage. — May 6 at 10:02
Because crypto-mining is the canonical example of sacrificing safety against performance. — May 6 at 10:13
Anecdotal evidence: On my older desktop computer, I've managed to fry both available CPU fan power ports (too much dust), so I decided to see if I could run the machine without a CPU fan by keeping a close eye on the CPU temperature. It hit 90C in a couple of minutes, then slowed down significantly. It was a Pentium, I believe. — May 6 at 10:13
@JohnDvorak Yes, all non-ancient CPUs employ a similar method — May 6 at 10:16

TooTeaTooTea 2295 · Accepted Answer · 2019-05-06 09:55:41Z

A house fire is extremely unlikely, but the lifespan of the card may be reduced.

Long-term overheating of the GPU chip probably won't start a fire. The chip may deteriorate and start misbehaving or die completely, but silicon chips aren't too flammable. Bad things usually happen when electrolytic capacitors fail and blow up, but these won't be subject to overheating just because the card is doing a lot of crunching and you also hopefully have a metal PC case to contain the hot shrapnel that results from such failures.

However, consumer-grade parts aren't in general designed for long-term 24/7 loads. It is thus fairly likely that the card will die sooner than if it wasn't subject to such loads. It is hard to say how much sooner without having some more statistics on a given model. Some people in the HPC community advocate using high-end gaming GPUs instead of special HPC compute parts, and there seems to be some economical sense in that. Although the commodity parts die in a year or so, it's cheaper to keep replacing them because they're many times cheaper than the alternative

Mechanical stress is the worst when heating up and cooling down, rather than running at a constant temperature. What the OP is planning to do is no worse than playing GPU-intensive games every day for a few months. — May 6 at 11:55
Unfortunately, Nvidia's license prevents a HPC data center from using consumer-grade gaming GPUs in their servers. We're required to use higher-end GPUs and I currently have a order in for P100's when the researchers would actually prefer 1080Ti cards. — May 6 at 16:11
@technical_difficulty The EULA for the nVidia driver says that it doesn't apply to datacenter HPC use (that's certainly a concern for large centers, but it doesn't stop people building in-house HPC clusters from consumer parts). There's a decent writeup here: microway.com/knowledge-center-articles/… — May 6 at 19:28
@user912264 Yes, that's what I meant, although presumably with a poor choice of words on my end. I'm by no means a lawyer or a licensing expert. My point is that you're not allowed to use the driver in such a situation (because you can't rely on the normal free license). — May 6 at 20:06
Also note that you get more wear on the GPU fans from keeping them at a high speed. — May 6 at 21:54

Sean HoulihaneSean Houlihane 1614 · Accepted Answer · 2019-05-07 08:35:43Z

Yes, the card is likely to wear out sooner if it is under constant load. At small geometries, Electromigration is a significant source of device failures, and devices will typically be designed with a specific target lifetime in mind. This might be generous for typical operation (e.g. 5 years continuous operation), but might not assume 100% maximum operating point for all of that time. As soon as you start over-clocking, you can expect that target to reduce significantly. (Equally, running at only 80% load would maybe double the lifetime due to this failure mechanism).

There are of course other failures which relate to running components hot, or thermal cycling, this is just to point out that modern electronics (and even 1980's electronics when badly designed) can be suceptible to 'wearing out'.

But would it wear out more doing the same workload over a shorter period of time? In other words, would running at 100% for some period be worse than running it at an average of 50% for double that time? — May 8 at 4:34
Yes, exactly. Hotter, or higher voltages mean higher fatigue for the same 'work'. To a first order approximation, for this effect. — May 8 at 6:19

tahreytahrey 1312 · Accepted Answer · 2019-05-07 18:57:05Z

If your cooling system works OK, and your hardware is of any kind of even vaguely modern design that includes on-chip temperature monitoring and thermal throttling/suspend/shutdown, then it's entirely safe. It can't overheat so long as the cooler keeps running, and if that fails, the chips will throttle back until they're no longer producing more heat than can be passively dissipated (which may mean having to suspend completely, appearing like a hang/crash).

Worst case, if the throttling doesn't kick in fast and hard enough to compensate for accumulated thermal load, some part of the chip may end up melting or burning out, and you'll end up with a dead board, but by that point the throttling circuitry should have slammed into complete emergency shutdown, maybe even tripping a (temporary or permanent) fuse on the power rail, preventing any kind of runaway dumping of the entire input voltage randomly across the die and an actual fire.

Thankfully, the PC platform worked out most of the kinks in that kind of thermal protection system 10-15 years ago, after the minor scandal of some mid generation PIIIs and Athlons proving entirely capable of completely smoking themselves (and thus being a fire risk) if the cooler failed or fell off whilst the CPU was running at full tilt. One generation of chips later and it could be easily demonstrated that an overclocked high-end processor barely exceeded the maximum rated temperature at the heat spreader surface if you tore the heatsink and fan off right in the middle of a heavy benchmark... the computer slowed to a crawl or even suffered a "fatal" (to the software; the hardware just needed the HSF replaced and rebooting) crash, but the chips survived and no risk arose. Hopefully any GPU maker worth their salt isn't going to be a decade and a half behind the curve, especially when their products can already run pretty close to their rated limits temperature-wise.

However, that doesn't make this kind of treatment entirely "safe" for the transistors on the chip. Heavyweight "number crunching" (Bitcoin? Protein folding?) using GPUs is by now a rather infamous way of literally wearing out the silicon. The combination of high voltage and current, continual switching billions of times per second, plus sustained high temperatures stress the components quite a bit, both the chips and the support parts like capacitors, so their operating lifetime can be reduced to barely two years in some cases, at least at full speed. They can then run on a bit longer if derated (maximum clock speed limited etc) and employed for less demanding purposes, like last year's games, but are on borrowed time once they start erroring out at maximum speed.

So it's not going to catch on fire, but I wouldn't bank on the card still being reliable past its third birthday in that employment...

Crypto mining especially tends to operate multiple cards packed onto one mobo with inadequate airflow, resulting in high temperatures. Using one card in a good tower case with proper airflow should be significantly less stressful, although there is wear and tear on the fans. And as Sean's and your answers point out, electromigration from being powered up to full voltage can still be a concern even if temperatures are kept in check. — May 7 at 20:34

bpalijbpalij 113 · Accepted Answer · 2019-05-08 08:59:38Z

As you have mentioned, ventilation is good, so no need to worry about this factor of risk.

Talking about the GPU, it will be worn stronger, than on usual office work for 8-16 hours a day, so when using on 100% 24/7/365 it is unlikely it will be able to work for 5-10 years and more. But you must also consider that the GPU can have a poor design of the cooling system of the GPU itself (not a PC overall), a bad overall design, software and firmware bugs, bad production quality or production defect(s) with different severity and defect rate - from single-instance defects to massive ones. These factors can make the heating worse, cause system failure, little lifetime, short-circuiting or even might cause a fire or make you electric struck. Some factors depend on the model and the revision, some are being gradually fixed with the software/firmware updates, some vary from one single item to another. Better choose models with proven reliability reputation with a proper revision (usually the latest possible). Also, it can have a bad influence and interfere badly with the other components, for example, by generating extra electric/electronic signal noise. Also, do not forget about the fact, that the thermal paste can gradually lose its qualities and make cooling worse.

I must mention, that the graphics card is not the only component to be considered, because a PC is a complex system and its successful work depends on the state of multiple components. Every single little, even if unnecessary and unused, bad component, even the floppy drive or some decorative lights may break the PC down or cause the problems close to the ones mentioned about the GPU. For example, a bad on/off button may cause shutdown or reboot. And now more deep about the key components:

CPU: in your use case it is likely to be used not harder than during ordinary day-to-day usage and it is likely that you absolutely do not need to overclock it. Nowadays CPUs feature all defensive mechanisms like throttling and emergency shutdown and are considered to be pretty durable. Just do not forget about the cooler and thermal paste and it is very unlikely to be the weakest point of the system.

Motherboard: almost the same as the CPU, but there is heavy usage of PCI-e and maybe heavy usage of disks, network and peripherals, but better choose proven models.

RAM: It is extremely unlikely to break, so this risk is not worthy of being worried about. Just use a good one.

Disks: in the tasks that rely on disk usage (like data mining, data processing, learning a neural network with the data on the disk) HDD can become a weak point in reliability - in servers and data centres it is pretty common to change a disk in 1-3 years and very rarely "live" 5 years or more. You can use RAID 1 and backup systems to increase reliability at 24/7/365 usage (RAID 0 sacrifices reliability for performance, other RAIDs can take a lot of time to restore data. Also RAID != back up, so do not neglect with backups, if required). When using SSD, operations, that are heavy on disk-writing can drain the terabytes-written limit and make the disk useless - prefer TBW over other features. RAID 1 with SSDs can defend the system against sudden failures of one disk, but do not help with TBW rate. HDD or SSD - depends on your needs, budget and choice. Better choose models with proven reliability reputation with a proper revision (usually the latest possible).

Power block: is heavily used by a graphics card and therefore worn more intensively - so better choose models with proven reliability reputation with a proper revision (usually the latest possible) and the power at least 1.5x more than the overall system consumption or at least 2x-2.5x more, than the main power consumers (as the GPU and the CPU). Be sure to use a good 220V AC cable, because of bad 220V AC cables are likely to cause short-circuiting, electric struck or burning (can just make smoke and self-destroy or set a real fire)!

Ventilators: while may seem insignificant they are crucial in such use-cases and their failure is a big problem for 24/7/365 systems. Generally, install as many as you can, but also consider the size - bigger ones are quieter and more effective while the smaller ones in some cases can be installed in a bigger amount, so the failure of one single ventilator will be less painful for the system - the choice is yours.

Exotic cooling systems: water cooling is considered to be compact and effective in high-heated overclocked systems, but water leakage can cause serious damage to PC`s components. Frozen nitrogen systems are extremely effective but likely not to be required, but are more bulky and expensive.

Professional enterprise 24/7/365 systems and components are better designed for that and have a reserve on all the components, even CPUs and BIOSes, and feature hot-replacement of components or modules, but even they do not feature 100% uptime (close, but not equal), professional Nvidia cards are faster for CUDA (especially neural networks) but I do not think it is your use case.

Assembling the system is not less important, than the components themselves. Do not forget about any single action, do not make something wrong, do not make a PC like a stupid and everything must be fine.

Make sure no software will forcibly shut down, reboot the PC or kill the process. If you are a Win10 user, you may think there is no way of entirely disabling the updates, but there are workarounds and pieces of software on the Web for that (Warning: it can violate EULA).

Peripherals can also cause problems, like the PC`s components. For example, a bad or worn mouse can register a button press when there is no press.

About key external circumstances:

Electricity: I hope the electricity in your house is very reliable and stable because switching off electricity can make you lose the results of your work. With short-time electric problems, UPS can help you, but with more long-time issues it can give you only time to hibernate the system or to save your progress correctly.

Network: if your task relies on the Internet or network connection, check if the wires/modem/router is ok.

Summing up: There is no solid warranty that everything will be good (literally, only death is guaranteed) and anyway you must accept the risks (they never will be equal to zero), but having a good choice of components, proper assembling and not having bad luck in buying defected components allows you use the PC that way with lower risk, then the question author initially assumed, unless you are going to do it for years and years and expect reliability for 5, 10 and more years.

Agent_LAgent_L 1,318813 · Accepted Answer · 2019-05-08 18:26:21Z

Is it safe to keep the GPU on 100% utilization for a very long time?

Yes. It's actually safer than using it for the intended purpose, that is playing a game once in a while.

The most wear (of the electronics) comes from mechanical stress from changing temperature. The components heat up at different rates, their thermal expansion coefficients are different, therefore every heat up, cool down cycle results in forces that try to tear the card apart, often resulting in micro-damages that accumulate and can eventually lead to failure. Don't be alarmed, it's supposed to take decades. (Unlike the infamous 2006 nVidia laptop GPUs that used wrong solder so the failures occurred soon enough to be noticeable within component's lifetime)

If you start your computation and keep them at constant rate, it's actually less stressful to the card, as it warms up and then stays there, without the thermal cycles.

The only parts that will see increased wear are the fans, which are usually easy to replace.

As to your plan on actual 100% utilization - 100% is inefficient. Learn from the lesson that cryptominers taught us: as you underclock and undervolt the card, the flops go down, but consumed power goes down even more. You'll get more performance per watt. And even better lifespan.

搜尋此網誌

Otdfbt

6 Answers
6

Your Answer

Post as a guest

6 Answers
6

6 Answers
6

Post as a guest

Popular posts from this blog

Vilaño, A Laracha Índice Patrimonio | Lugares e parroquias | Véxase tamén | Menú de navegación43°14′52″N 8°36′03″O / 43.24775, -8.60070

6 Answers 6

Your Answer

Sign up or log in

Post as a guest

Post as a guest

6 Answers 6

6 Answers 6

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Vilaño, A Laracha Índice Patrimonio | Lugares e parroquias | Véxase tamén | Menú de navegación43°14′52″N 8°36′03″O﻿ / ﻿43.24775, -8.60070

6 Answers
6

6 Answers
6

6 Answers
6

Vilaño, A Laracha Índice Patrimonio | Lugares e parroquias | Véxase tamén | Menú de navegación43°14′52″N 8°36′03″O / 43.24775, -8.60070