can high load cause server hang and error “blocked for more than 120 seconds”?Linux NFS hangs after about 15 minutesLinux Kernel crash mutex_lock_slowpath “blocked for more than 120 seconds”. What to do?Log - Server kernel: INFO: task httpd:000000 blocked for more than 120 secondstask blocked for more than 120 secondshow to control system load of a particular process? eg. JavaRandomly crashing Ubuntu 10.04 on multiple Xen VPS host instancesUbuntu 10.04 Xen guest - why would time drift be proportionate with the system load?“postgres blocked for more than 120 seconds” - is my db still consistent?Kernel 3.8, Apache2 with WSGI : INFO: task apache2 blocked for more than 120 secondstask fstrim blocked for more than 120 secondstask nginx:4164 blocked for more than 120 seconds

How to figure out whether the data is sample data or population data apart from the client's information?

How does a Swashbuckler rogue "fight with two weapons while safely darting away"?

Why is the origin of “threshold” uncertain?

Modify locally tikzset

Why was Germany not as successful as other Europeans in establishing overseas colonies?

Packing rectangles: Does rotation ever help?

Confusion about capacitors

Feels like I am getting dragged in office politics

Colliding particles and Activation energy

Possible to set `foldexpr` using a function reference?

Why is current rating for multicore cable lower than single core with the same cross section?

Why does processed meat contain preservatives, while canned fish needs not?

You look catfish vs You look like a catfish

Examples of non trivial equivalence relations , I mean equivalence relations without the expression " same ... as" in their definition?

Did Henry V’s archers at Agincourt fight with no pants / breeches on because of dysentery?

"ne paelici suspectaretur" (Tacitus)

How to back up a running remote server?

Build a trail cart

TikZ how to make supply and demand arrows for nodes?

Phrase for the opposite of "foolproof"

What's the metal clinking sound at the end of credits in Avengers: Endgame?

Do I have to worry about players making “bad” choices on level up?

Is GOCE a satellite or aircraft?

Unexpected email from Yorkshire Bank

can high load cause server hang and error “blocked for more than 120 seconds”?

Linux NFS hangs after about 15 minutesLinux Kernel crash mutex_lock_slowpath “blocked for more than 120 seconds”. What to do?Log - Server kernel: INFO: task httpd:000000 blocked for more than 120 secondstask blocked for more than 120 secondshow to control system load of a particular process? eg. JavaRandomly crashing Ubuntu 10.04 on multiple Xen VPS host instancesUbuntu 10.04 Xen guest - why would time drift be proportionate with the system load?“postgres blocked for more than 120 seconds” - is my db still consistent?Kernel 3.8, Apache2 with WSGI : INFO: task apache2 blocked for more than 120 secondstask fstrim blocked for more than 120 secondstask nginx:4164 blocked for more than 120 seconds

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

Currently running a few VM's and 'baremetal' servers.
Java is running on high - over 400%+ at times.
Randomly the server hangs with the error in the console "java - blocked for more than 120 seconds" - kjournald, etc.

I cannot get a dmesg output because for some reason this error only writes to the console, which I don't have access to since this is remotely hosted. therefore I cannot copy a full trace.

I changed the environment this is on - even physical server and it's still happening.

I changed hung_task_timeout_secs to 0 incase this is a false positive as per http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Technical_Notes/deployment.html .

Also, irqbalance is not installed, perhaps it would help?

this is Ubuntu 10.04 64bit - same issue with latest 2.6.38-15-server and 2.6.36 .

could cpu or memory issues/no swap left cause this issue?

here is the console message:

[58Z?Z1.5?Z840] INFUI task java:21547 blocked for more than 120 seconds.
[58Z?Z1.5?Z986] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z841.5?Z06Z] INFUI task kjournald:190 blocked for more than 120 seconds.
[58Z841.5?Z336] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z841.5?Z600] INFUI task flush-202:0:709 blocked for more than 120 seconds.
[58Z841.5?Z90?] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z841.5?3413] INFUI task java:21547 blocked for more than 120 seconds.
[58Z841.5?368Z] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z961.5?ZZ36] INFUI task kjournald:60 blocked for more than 120 seconds.
[58Z961.5?Z6Z5] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z961.5?31ZZ] INFUI task flush-202:0:709 blocked for more than 120 seconds.
[58Z961.5?3393] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.

edited Jul 5 '12 at 21:49

asked Jul 5 '12 at 21:41

Tee

86114

add a comment |

I cannot get a dmesg output because for some reason this error only writes to the console, which I don't have access to since this is remotely hosted. therefore I cannot copy a full trace.

I changed the environment this is on - even physical server and it's still happening.

I changed hung_task_timeout_secs to 0 incase this is a false positive as per http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Technical_Notes/deployment.html .

Also, irqbalance is not installed, perhaps it would help?

this is Ubuntu 10.04 64bit - same issue with latest 2.6.38-15-server and 2.6.36 .

could cpu or memory issues/no swap left cause this issue?

here is the console message:

[58Z?Z1.5?Z840] INFUI task java:21547 blocked for more than 120 seconds.
[58Z?Z1.5?Z986] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z841.5?Z06Z] INFUI task kjournald:190 blocked for more than 120 seconds.
[58Z841.5?Z336] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z841.5?Z600] INFUI task flush-202:0:709 blocked for more than 120 seconds.
[58Z841.5?Z90?] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z841.5?3413] INFUI task java:21547 blocked for more than 120 seconds.
[58Z841.5?368Z] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z961.5?ZZ36] INFUI task kjournald:60 blocked for more than 120 seconds.
[58Z961.5?Z6Z5] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z961.5?31ZZ] INFUI task flush-202:0:709 blocked for more than 120 seconds.
[58Z961.5?3393] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.

edited Jul 5 '12 at 21:49

asked Jul 5 '12 at 21:41

Tee

86114

add a comment |

I cannot get a dmesg output because for some reason this error only writes to the console, which I don't have access to since this is remotely hosted. therefore I cannot copy a full trace.

I changed the environment this is on - even physical server and it's still happening.

I changed hung_task_timeout_secs to 0 incase this is a false positive as per http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Technical_Notes/deployment.html .

Also, irqbalance is not installed, perhaps it would help?

this is Ubuntu 10.04 64bit - same issue with latest 2.6.38-15-server and 2.6.36 .

could cpu or memory issues/no swap left cause this issue?

here is the console message:

[58Z?Z1.5?Z840] INFUI task java:21547 blocked for more than 120 seconds.
[58Z?Z1.5?Z986] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z841.5?Z06Z] INFUI task kjournald:190 blocked for more than 120 seconds.
[58Z841.5?Z336] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z841.5?Z600] INFUI task flush-202:0:709 blocked for more than 120 seconds.
[58Z841.5?Z90?] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z841.5?3413] INFUI task java:21547 blocked for more than 120 seconds.
[58Z841.5?368Z] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z961.5?ZZ36] INFUI task kjournald:60 blocked for more than 120 seconds.
[58Z961.5?Z6Z5] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z961.5?31ZZ] INFUI task flush-202:0:709 blocked for more than 120 seconds.
[58Z961.5?3393] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.

edited Jul 5 '12 at 21:49

asked Jul 5 '12 at 21:41

Tee

86114

I cannot get a dmesg output because for some reason this error only writes to the console, which I don't have access to since this is remotely hosted. therefore I cannot copy a full trace.

I changed the environment this is on - even physical server and it's still happening.

I changed hung_task_timeout_secs to 0 incase this is a false positive as per http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Technical_Notes/deployment.html .

Also, irqbalance is not installed, perhaps it would help?

this is Ubuntu 10.04 64bit - same issue with latest 2.6.38-15-server and 2.6.36 .

could cpu or memory issues/no swap left cause this issue?

here is the console message:

[58Z?Z1.5?Z840] INFUI task java:21547 blocked for more than 120 seconds.
[58Z?Z1.5?Z986] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z841.5?Z06Z] INFUI task kjournald:190 blocked for more than 120 seconds.
[58Z841.5?Z336] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z841.5?Z600] INFUI task flush-202:0:709 blocked for more than 120 seconds.
[58Z841.5?Z90?] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z841.5?3413] INFUI task java:21547 blocked for more than 120 seconds.
[58Z841.5?368Z] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z961.5?ZZ36] INFUI task kjournald:60 blocked for more than 120 seconds.
[58Z961.5?Z6Z5] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z961.5?31ZZ] INFUI task flush-202:0:709 blocked for more than 120 seconds.
[58Z961.5?3393] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.

linux kernel

edited Jul 5 '12 at 21:49

asked Jul 5 '12 at 21:41

Tee

86114

edited Jul 5 '12 at 21:49

asked Jul 5 '12 at 21:41

Tee

86114

edited Jul 5 '12 at 21:49

asked Jul 5 '12 at 21:41

Tee

86114

asked Jul 5 '12 at 21:41

Tee

86114

asked Jul 5 '12 at 21:41

Tee

86114

add a comment |

3 Answers
3

active

oldest

votes

Yes, it could.

What this means is fairly explicit: the kernel couldn't schedule the task for 120 seconds. This indicates resource starvation, often around disk access.

irqbalance might help, but that doesn't sound obvious. Can you provide us with the surrounding of this message in dmesg, in particular the stack trace that follows it?

Moreover, this is not a false positive. This does not say that the task is hung forever, and the statement is perfectly correct. That doesn't mean it's a problem for you, and you can decide to ignore it if you don't notice any user impact.

This cannot be caused by:

a CPU issue (or rather, that would be an insanely improbable hardware failure),

a memory issue (very improbably a hardware failure, but wouldn't happen multiple times; not a lack of RAM as a process would be oom-killed),

a lack of swap (oom-killer again).

To an extend, you might be able blame this on a lack of memory in the sense that depriving your system of data caching in RAM will cause more I/O. But it's not as straightforward as "running out of memory".

edited Jul 5 '12 at 21:48

answered Jul 5 '12 at 21:43

Pierre Carrier

2,4521126

There is nothing being recorded to /var/log/dmesg so I just pasted what the Console showed.. when this appears the system is 100% hung.

– Tee
Jul 5 '12 at 21:51

This message comes from the kernel, it will appear in dmesg (if it was logged recently enough) as this command prints the kernel logging ring buffer. Hopefully your syslog setup will also log it somewhere in /var/log, but I couldn't know where.

– Pierre Carrier
Jul 5 '12 at 22:19

The message will NOT appear in /var/log/dmesg, but may turn up when you run the dmesg command. The file is created during the boot process and generally only captures boot-time kernel messages (which would otherwise eventually scroll out of the kernel ring buffer. You could also install/enable sysstat and look at resource utilization as reported there. I'm suspecting disk I/O / iowait, likely related to swapping (sysstat will help in identifying this).

– Dr. Edward Morbius
Jul 6 '12 at 19:23

@Dr.EdwardMorbius So how do we fix this? I'm having a major issue related to this with our Zimbra server which was running great in a production environment until recently.

– Lopsided
Mar 21 '14 at 14:45

@Lopsided: Sorry for the delay, I'm not here often. Briefly: you'll have to profile your Java process and find out why it's hanging. Garbage collection is one area I've had issues (and successes) in tuning. Look up JVM garbage collection ergodymics and see oracle.com/technetwork/java/javase/gc-tuning-6-140523.html I found increasing heap helped markedly.

– Dr. Edward Morbius
Apr 26 '14 at 10:40

add a comment |

sudo sysctl -w vm.dirty_ratio=10
sudo sysctl -w vm.dirty_background_ratio=5

Then commit the change with:

sudo sysctl -p

solved it for me....

edited Apr 21 at 19:32

Glutanimate

1034

answered Feb 21 '16 at 11:48

Nick

6111

6

You should explain what each those settings do.

– kasperd
Feb 21 '16 at 16:36

5

This fixed a similar issue I was having in a docker environment. I found an explanation here: blackmoreops.com/2014/09/22/…. "By default Linux uses up to 40% of the available memory for file system caching. After this mark has been reached the file system flushes all outstanding data to disk causing all following IOs going synchronous. For flushing out this data to disk this there is a time limit of 120 seconds by default. In the case here the IO subsystem is not fast enough to flush the data withing..."

– Peter M
Feb 29 '16 at 16:35

add a comment |

I recently went through this error in one of our Production clusters:

Nov 11 14:56:41 xxx kernel: INFO: task xfsalloc/3:2393 blocked for
more than 120 seconds.

Nov 11 14:56:41 Xxxx kernel: Not tainted 2.6.32-504.8.1.el6.x86_64 #1

Nov 11 14:56:41 xxx: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

On further verification of the sar logs Found the IO wait was increased during the same time.

And upon checking the Hardware (Physical Disks) saw medium errors and other SCSI Errors had logged on one the Physical Disks, which in turn was blocking the IOs, due to lack of resources to allocate.

11/11/15 19:52:40: terminatated pRdm 607b8000 flags=0 TimeOutC=0
RetryC=0 Request c1173100 Reply 60e06040 iocStatus 0048 retryC 0
devId:3 devFlags=f1482005 iocLogInfo:31140000

11/11/15 19:52:40: DM_ProcessDevWaitQueue: Task mgmt in process
devId=x 11/11/15 19:52:40: DM_ProcessDevWaitQueue: Task mgmt in
process devId=x

So this was due to hardware error, in our cluster.

So it would be good, if you could check for core file and also if ipmi utility is there, check for ipmiutil/ipmitool sel elist command to check for the issue.

Regards,
VT

answered Nov 12 '15 at 15:27

Varun Thomas

211

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "2"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f405210%2fcan-high-load-cause-server-hang-and-error-blocked-for-more-than-120-seconds%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

Yes, it could.

What this means is fairly explicit: the kernel couldn't schedule the task for 120 seconds. This indicates resource starvation, often around disk access.

irqbalance might help, but that doesn't sound obvious. Can you provide us with the surrounding of this message in dmesg, in particular the stack trace that follows it?

This cannot be caused by:

a CPU issue (or rather, that would be an insanely improbable hardware failure),

a memory issue (very improbably a hardware failure, but wouldn't happen multiple times; not a lack of RAM as a process would be oom-killed),

a lack of swap (oom-killer again).

edited Jul 5 '12 at 21:48

answered Jul 5 '12 at 21:43

Pierre Carrier

2,4521126

There is nothing being recorded to /var/log/dmesg so I just pasted what the Console showed.. when this appears the system is 100% hung.

– Tee
Jul 5 '12 at 21:51

This message comes from the kernel, it will appear in dmesg (if it was logged recently enough) as this command prints the kernel logging ring buffer. Hopefully your syslog setup will also log it somewhere in /var/log, but I couldn't know where.

– Pierre Carrier
Jul 5 '12 at 22:19

The message will NOT appear in /var/log/dmesg, but may turn up when you run the dmesg command. The file is created during the boot process and generally only captures boot-time kernel messages (which would otherwise eventually scroll out of the kernel ring buffer. You could also install/enable sysstat and look at resource utilization as reported there. I'm suspecting disk I/O / iowait, likely related to swapping (sysstat will help in identifying this).

– Dr. Edward Morbius
Jul 6 '12 at 19:23

@Dr.EdwardMorbius So how do we fix this? I'm having a major issue related to this with our Zimbra server which was running great in a production environment until recently.

– Lopsided
Mar 21 '14 at 14:45

@Lopsided: Sorry for the delay, I'm not here often. Briefly: you'll have to profile your Java process and find out why it's hanging. Garbage collection is one area I've had issues (and successes) in tuning. Look up JVM garbage collection ergodymics and see oracle.com/technetwork/java/javase/gc-tuning-6-140523.html I found increasing heap helped markedly.

– Dr. Edward Morbius
Apr 26 '14 at 10:40

add a comment |

Yes, it could.

What this means is fairly explicit: the kernel couldn't schedule the task for 120 seconds. This indicates resource starvation, often around disk access.

irqbalance might help, but that doesn't sound obvious. Can you provide us with the surrounding of this message in dmesg, in particular the stack trace that follows it?

This cannot be caused by:

a CPU issue (or rather, that would be an insanely improbable hardware failure),

a memory issue (very improbably a hardware failure, but wouldn't happen multiple times; not a lack of RAM as a process would be oom-killed),

a lack of swap (oom-killer again).

edited Jul 5 '12 at 21:48

answered Jul 5 '12 at 21:43

Pierre Carrier

2,4521126

There is nothing being recorded to /var/log/dmesg so I just pasted what the Console showed.. when this appears the system is 100% hung.

– Tee
Jul 5 '12 at 21:51

This message comes from the kernel, it will appear in dmesg (if it was logged recently enough) as this command prints the kernel logging ring buffer. Hopefully your syslog setup will also log it somewhere in /var/log, but I couldn't know where.

– Pierre Carrier
Jul 5 '12 at 22:19

The message will NOT appear in /var/log/dmesg, but may turn up when you run the dmesg command. The file is created during the boot process and generally only captures boot-time kernel messages (which would otherwise eventually scroll out of the kernel ring buffer. You could also install/enable sysstat and look at resource utilization as reported there. I'm suspecting disk I/O / iowait, likely related to swapping (sysstat will help in identifying this).

– Dr. Edward Morbius
Jul 6 '12 at 19:23

@Dr.EdwardMorbius So how do we fix this? I'm having a major issue related to this with our Zimbra server which was running great in a production environment until recently.

– Lopsided
Mar 21 '14 at 14:45

@Lopsided: Sorry for the delay, I'm not here often. Briefly: you'll have to profile your Java process and find out why it's hanging. Garbage collection is one area I've had issues (and successes) in tuning. Look up JVM garbage collection ergodymics and see oracle.com/technetwork/java/javase/gc-tuning-6-140523.html I found increasing heap helped markedly.

– Dr. Edward Morbius
Apr 26 '14 at 10:40

add a comment |

Yes, it could.

What this means is fairly explicit: the kernel couldn't schedule the task for 120 seconds. This indicates resource starvation, often around disk access.

irqbalance might help, but that doesn't sound obvious. Can you provide us with the surrounding of this message in dmesg, in particular the stack trace that follows it?

This cannot be caused by:

a CPU issue (or rather, that would be an insanely improbable hardware failure),

a memory issue (very improbably a hardware failure, but wouldn't happen multiple times; not a lack of RAM as a process would be oom-killed),

a lack of swap (oom-killer again).

edited Jul 5 '12 at 21:48

answered Jul 5 '12 at 21:43

Pierre Carrier

2,4521126

Yes, it could.

What this means is fairly explicit: the kernel couldn't schedule the task for 120 seconds. This indicates resource starvation, often around disk access.

irqbalance might help, but that doesn't sound obvious. Can you provide us with the surrounding of this message in dmesg, in particular the stack trace that follows it?

This cannot be caused by:

a CPU issue (or rather, that would be an insanely improbable hardware failure),

a memory issue (very improbably a hardware failure, but wouldn't happen multiple times; not a lack of RAM as a process would be oom-killed),

a lack of swap (oom-killer again).

edited Jul 5 '12 at 21:48

answered Jul 5 '12 at 21:43

Pierre Carrier

2,4521126

edited Jul 5 '12 at 21:48

answered Jul 5 '12 at 21:43

Pierre Carrier

2,4521126

answered Jul 5 '12 at 21:43

Pierre Carrier

2,4521126

answered Jul 5 '12 at 21:43

Pierre Carrier

2,4521126

There is nothing being recorded to /var/log/dmesg so I just pasted what the Console showed.. when this appears the system is 100% hung.

– Tee
Jul 5 '12 at 21:51

This message comes from the kernel, it will appear in dmesg (if it was logged recently enough) as this command prints the kernel logging ring buffer. Hopefully your syslog setup will also log it somewhere in /var/log, but I couldn't know where.

– Pierre Carrier
Jul 5 '12 at 22:19

The message will NOT appear in /var/log/dmesg, but may turn up when you run the dmesg command. The file is created during the boot process and generally only captures boot-time kernel messages (which would otherwise eventually scroll out of the kernel ring buffer. You could also install/enable sysstat and look at resource utilization as reported there. I'm suspecting disk I/O / iowait, likely related to swapping (sysstat will help in identifying this).

– Dr. Edward Morbius
Jul 6 '12 at 19:23

@Dr.EdwardMorbius So how do we fix this? I'm having a major issue related to this with our Zimbra server which was running great in a production environment until recently.

– Lopsided
Mar 21 '14 at 14:45

@Lopsided: Sorry for the delay, I'm not here often. Briefly: you'll have to profile your Java process and find out why it's hanging. Garbage collection is one area I've had issues (and successes) in tuning. Look up JVM garbage collection ergodymics and see oracle.com/technetwork/java/javase/gc-tuning-6-140523.html I found increasing heap helped markedly.

– Dr. Edward Morbius
Apr 26 '14 at 10:40

add a comment |

There is nothing being recorded to /var/log/dmesg so I just pasted what the Console showed.. when this appears the system is 100% hung.

– Tee
Jul 5 '12 at 21:51

This message comes from the kernel, it will appear in dmesg (if it was logged recently enough) as this command prints the kernel logging ring buffer. Hopefully your syslog setup will also log it somewhere in /var/log, but I couldn't know where.

– Pierre Carrier
Jul 5 '12 at 22:19

The message will NOT appear in /var/log/dmesg, but may turn up when you run the dmesg command. The file is created during the boot process and generally only captures boot-time kernel messages (which would otherwise eventually scroll out of the kernel ring buffer. You could also install/enable sysstat and look at resource utilization as reported there. I'm suspecting disk I/O / iowait, likely related to swapping (sysstat will help in identifying this).

– Dr. Edward Morbius
Jul 6 '12 at 19:23

@Dr.EdwardMorbius So how do we fix this? I'm having a major issue related to this with our Zimbra server which was running great in a production environment until recently.

– Lopsided
Mar 21 '14 at 14:45

@Lopsided: Sorry for the delay, I'm not here often. Briefly: you'll have to profile your Java process and find out why it's hanging. Garbage collection is one area I've had issues (and successes) in tuning. Look up JVM garbage collection ergodymics and see oracle.com/technetwork/java/javase/gc-tuning-6-140523.html I found increasing heap helped markedly.

– Dr. Edward Morbius
Apr 26 '14 at 10:40

There is nothing being recorded to /var/log/dmesg so I just pasted what the Console showed.. when this appears the system is 100% hung.

– Tee
Jul 5 '12 at 21:51

This message comes from the kernel, it will appear in dmesg (if it was logged recently enough) as this command prints the kernel logging ring buffer. Hopefully your syslog setup will also log it somewhere in /var/log, but I couldn't know where.

– Pierre Carrier
Jul 5 '12 at 22:19

The message will NOT appear in /var/log/dmesg, but may turn up when you run the dmesg command. The file is created during the boot process and generally only captures boot-time kernel messages (which would otherwise eventually scroll out of the kernel ring buffer. You could also install/enable sysstat and look at resource utilization as reported there. I'm suspecting disk I/O / iowait, likely related to swapping (sysstat will help in identifying this).

– Dr. Edward Morbius
Jul 6 '12 at 19:23

@Dr.EdwardMorbius So how do we fix this? I'm having a major issue related to this with our Zimbra server which was running great in a production environment until recently.

– Lopsided
Mar 21 '14 at 14:45

@Lopsided: Sorry for the delay, I'm not here often. Briefly: you'll have to profile your Java process and find out why it's hanging. Garbage collection is one area I've had issues (and successes) in tuning. Look up JVM garbage collection ergodymics and see oracle.com/technetwork/java/javase/gc-tuning-6-140523.html I found increasing heap helped markedly.

– Dr. Edward Morbius
Apr 26 '14 at 10:40

add a comment |

sudo sysctl -w vm.dirty_ratio=10
sudo sysctl -w vm.dirty_background_ratio=5

Then commit the change with:

sudo sysctl -p

solved it for me....

edited Apr 21 at 19:32

Glutanimate

1034

answered Feb 21 '16 at 11:48

Nick

6111

6

You should explain what each those settings do.

– kasperd
Feb 21 '16 at 16:36

5

This fixed a similar issue I was having in a docker environment. I found an explanation here: blackmoreops.com/2014/09/22/…. "By default Linux uses up to 40% of the available memory for file system caching. After this mark has been reached the file system flushes all outstanding data to disk causing all following IOs going synchronous. For flushing out this data to disk this there is a time limit of 120 seconds by default. In the case here the IO subsystem is not fast enough to flush the data withing..."

– Peter M
Feb 29 '16 at 16:35

add a comment |

sudo sysctl -w vm.dirty_ratio=10
sudo sysctl -w vm.dirty_background_ratio=5

Then commit the change with:

sudo sysctl -p

solved it for me....

edited Apr 21 at 19:32

Glutanimate

1034

answered Feb 21 '16 at 11:48

Nick

6111

6

You should explain what each those settings do.

– kasperd
Feb 21 '16 at 16:36

5

This fixed a similar issue I was having in a docker environment. I found an explanation here: blackmoreops.com/2014/09/22/…. "By default Linux uses up to 40% of the available memory for file system caching. After this mark has been reached the file system flushes all outstanding data to disk causing all following IOs going synchronous. For flushing out this data to disk this there is a time limit of 120 seconds by default. In the case here the IO subsystem is not fast enough to flush the data withing..."

– Peter M
Feb 29 '16 at 16:35

add a comment |

sudo sysctl -w vm.dirty_ratio=10
sudo sysctl -w vm.dirty_background_ratio=5

Then commit the change with:

sudo sysctl -p

solved it for me....

edited Apr 21 at 19:32

Glutanimate

1034

answered Feb 21 '16 at 11:48

Nick

6111

sudo sysctl -w vm.dirty_ratio=10
sudo sysctl -w vm.dirty_background_ratio=5

Then commit the change with:

sudo sysctl -p

solved it for me....

edited Apr 21 at 19:32

Glutanimate

1034

answered Feb 21 '16 at 11:48

Nick

6111

edited Apr 21 at 19:32

Glutanimate

1034

edited Apr 21 at 19:32

Glutanimate

1034

edited Apr 21 at 19:32

Glutanimate

1034

answered Feb 21 '16 at 11:48

Nick

6111

answered Feb 21 '16 at 11:48

Nick

6111

answered Feb 21 '16 at 11:48

Nick

6111

6

You should explain what each those settings do.

– kasperd
Feb 21 '16 at 16:36

5

This fixed a similar issue I was having in a docker environment. I found an explanation here: blackmoreops.com/2014/09/22/…. "By default Linux uses up to 40% of the available memory for file system caching. After this mark has been reached the file system flushes all outstanding data to disk causing all following IOs going synchronous. For flushing out this data to disk this there is a time limit of 120 seconds by default. In the case here the IO subsystem is not fast enough to flush the data withing..."

– Peter M
Feb 29 '16 at 16:35

add a comment |

6

You should explain what each those settings do.

– kasperd
Feb 21 '16 at 16:36

5

This fixed a similar issue I was having in a docker environment. I found an explanation here: blackmoreops.com/2014/09/22/…. "By default Linux uses up to 40% of the available memory for file system caching. After this mark has been reached the file system flushes all outstanding data to disk causing all following IOs going synchronous. For flushing out this data to disk this there is a time limit of 120 seconds by default. In the case here the IO subsystem is not fast enough to flush the data withing..."

– Peter M
Feb 29 '16 at 16:35

You should explain what each those settings do.

– kasperd
Feb 21 '16 at 16:36

This fixed a similar issue I was having in a docker environment. I found an explanation here: blackmoreops.com/2014/09/22/…. "By default Linux uses up to 40% of the available memory for file system caching. After this mark has been reached the file system flushes all outstanding data to disk causing all following IOs going synchronous. For flushing out this data to disk this there is a time limit of 120 seconds by default. In the case here the IO subsystem is not fast enough to flush the data withing..."

– Peter M
Feb 29 '16 at 16:35

add a comment |

I recently went through this error in one of our Production clusters:

Nov 11 14:56:41 xxx kernel: INFO: task xfsalloc/3:2393 blocked for
more than 120 seconds.

Nov 11 14:56:41 Xxxx kernel: Not tainted 2.6.32-504.8.1.el6.x86_64 #1

Nov 11 14:56:41 xxx: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

On further verification of the sar logs Found the IO wait was increased during the same time.

And upon checking the Hardware (Physical Disks) saw medium errors and other SCSI Errors had logged on one the Physical Disks, which in turn was blocking the IOs, due to lack of resources to allocate.

11/11/15 19:52:40: terminatated pRdm 607b8000 flags=0 TimeOutC=0
RetryC=0 Request c1173100 Reply 60e06040 iocStatus 0048 retryC 0
devId:3 devFlags=f1482005 iocLogInfo:31140000

11/11/15 19:52:40: DM_ProcessDevWaitQueue: Task mgmt in process
devId=x 11/11/15 19:52:40: DM_ProcessDevWaitQueue: Task mgmt in
process devId=x

So this was due to hardware error, in our cluster.

So it would be good, if you could check for core file and also if ipmi utility is there, check for ipmiutil/ipmitool sel elist command to check for the issue.

Regards,
VT

answered Nov 12 '15 at 15:27

Varun Thomas

211

add a comment |

I recently went through this error in one of our Production clusters:

Nov 11 14:56:41 xxx kernel: INFO: task xfsalloc/3:2393 blocked for
more than 120 seconds.

Nov 11 14:56:41 Xxxx kernel: Not tainted 2.6.32-504.8.1.el6.x86_64 #1

Nov 11 14:56:41 xxx: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

On further verification of the sar logs Found the IO wait was increased during the same time.

And upon checking the Hardware (Physical Disks) saw medium errors and other SCSI Errors had logged on one the Physical Disks, which in turn was blocking the IOs, due to lack of resources to allocate.

11/11/15 19:52:40: terminatated pRdm 607b8000 flags=0 TimeOutC=0
RetryC=0 Request c1173100 Reply 60e06040 iocStatus 0048 retryC 0
devId:3 devFlags=f1482005 iocLogInfo:31140000

11/11/15 19:52:40: DM_ProcessDevWaitQueue: Task mgmt in process
devId=x 11/11/15 19:52:40: DM_ProcessDevWaitQueue: Task mgmt in
process devId=x

So this was due to hardware error, in our cluster.

So it would be good, if you could check for core file and also if ipmi utility is there, check for ipmiutil/ipmitool sel elist command to check for the issue.

Regards,
VT

answered Nov 12 '15 at 15:27

Varun Thomas

211

add a comment |

I recently went through this error in one of our Production clusters:

Nov 11 14:56:41 xxx kernel: INFO: task xfsalloc/3:2393 blocked for
more than 120 seconds.

Nov 11 14:56:41 Xxxx kernel: Not tainted 2.6.32-504.8.1.el6.x86_64 #1

Nov 11 14:56:41 xxx: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

On further verification of the sar logs Found the IO wait was increased during the same time.

And upon checking the Hardware (Physical Disks) saw medium errors and other SCSI Errors had logged on one the Physical Disks, which in turn was blocking the IOs, due to lack of resources to allocate.

11/11/15 19:52:40: terminatated pRdm 607b8000 flags=0 TimeOutC=0
RetryC=0 Request c1173100 Reply 60e06040 iocStatus 0048 retryC 0
devId:3 devFlags=f1482005 iocLogInfo:31140000

11/11/15 19:52:40: DM_ProcessDevWaitQueue: Task mgmt in process
devId=x 11/11/15 19:52:40: DM_ProcessDevWaitQueue: Task mgmt in
process devId=x

So this was due to hardware error, in our cluster.

So it would be good, if you could check for core file and also if ipmi utility is there, check for ipmiutil/ipmitool sel elist command to check for the issue.

Regards,
VT

answered Nov 12 '15 at 15:27

Varun Thomas

211

I recently went through this error in one of our Production clusters:

Nov 11 14:56:41 xxx kernel: INFO: task xfsalloc/3:2393 blocked for
more than 120 seconds.

Nov 11 14:56:41 Xxxx kernel: Not tainted 2.6.32-504.8.1.el6.x86_64 #1

Nov 11 14:56:41 xxx: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

On further verification of the sar logs Found the IO wait was increased during the same time.

And upon checking the Hardware (Physical Disks) saw medium errors and other SCSI Errors had logged on one the Physical Disks, which in turn was blocking the IOs, due to lack of resources to allocate.

11/11/15 19:52:40: terminatated pRdm 607b8000 flags=0 TimeOutC=0
RetryC=0 Request c1173100 Reply 60e06040 iocStatus 0048 retryC 0
devId:3 devFlags=f1482005 iocLogInfo:31140000

11/11/15 19:52:40: DM_ProcessDevWaitQueue: Task mgmt in process
devId=x 11/11/15 19:52:40: DM_ProcessDevWaitQueue: Task mgmt in
process devId=x

So this was due to hardware error, in our cluster.

So it would be good, if you could check for core file and also if ipmi utility is there, check for ipmiutil/ipmitool sel elist command to check for the issue.

Regards,
VT

answered Nov 12 '15 at 15:27

Varun Thomas

211

answered Nov 12 '15 at 15:27

Varun Thomas

211

answered Nov 12 '15 at 15:27

Varun Thomas

211

answered Nov 12 '15 at 15:27

Varun Thomas

211

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Server Fault!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Otdfbt

3 Answers
3

Your Answer

Post as a guest

3 Answers
3

3 Answers
3

Post as a guest

Popular posts from this blog

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

3 Answers 3

3 Answers 3

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

3 Answers
3

3 Answers
3

3 Answers
3