[#FS-4670] Session limit value arbitrarily changes

advertisement
[FS-4670] Session limit value arbitrarily changes Created: 01/Oct/12
Updated: 24/Feb/14 Resolved:
16/Oct/12
Status:
Project:
Component/s:
Affects
Version/s:
Fix Version/s:
Security Level:
Closed
FreeSWITCH
None
None
Type:
Reporter:
Resolution:
Labels:
Remaining
Estimate:
Time Spent:
Original
Estimate:
Environment:
Bug
Umberto Mautone
Fixed
None
Not Specified
Attachments:
err.patch
x86
CPU
Architecture:
Kernel:
uname:
Userland:
Compiler:
FreeSWITCH
GIT Revision:
GIT Master
Revision hash::
None
public
Priority:
Assignee:
Votes:
Minor
Anthony Minessale II
0
Not Specified
Not Specified
CentOS 5.4 64-bit
fs-cps.jpg
fs-ports.jpg
task.patch
Linux
Linux 424989-web4 2.6.18-308.8.2.el5 #1 SMP Tue Jun 12 09:58:12 EDT
2012 x86_64 x86_64 x86_64 GNU/Linux
GNU/Linux
gcc
0cb64960f1378be6883f0c35e2aae5b7a227c6cb
Last update Sep 28th
Description
As of the last few updates, I've been seeing strange behavior with regards to session limits. I
tried raising the session limit in switch.conf.xml to a very high number:
<param name="max-sessions" value="100000"/>
As of about 20 minutes ago, the switch has not been able to exceed 500 simultaneous calls
(1000 sessions) and this is what I get in the log:
2012-10-01 11:10:14.835218 [CRIT] switch_core_session.c:2049 Over Session Limit! 1091
2012-10-01 11:10:14.835218 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-01 11:10:14.835218 [CRIT] switch_core_session.c:2049 Over Session Limit! 1091
2012-10-01 11:10:14.835218 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-01 11:10:14.835218 [CRIT] switch_core_session.c:2049 Over Session Limit! 1091
2012-10-01 11:10:14.835218 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-01 11:10:14.855219 [CRIT] switch_core_session.c:2049 Over Session Limit! 1091
2012-10-01 11:10:14.855219 [CRIT] mod_sofia.c:4543 Error Creating Session
The only way to fix the problem is to restart Freeswitch.
Comments
Comment by Anthony Minessale II [ 01/Oct/12 ]
Do you see any other errors stating that the session limit has been lowered? grep for CRIT
WARN or ERR in the log.
If the system fails to create a thread at any point it will force the session limit down to the level
of calls currently available to prevent failures.
If you see it saying 1091 and that is not the number you set it to it probably lowered itself.
You can increase it while its running with "fsctl sps <new val>"
Comment by Umberto Mautone [ 01/Oct/12 ]
Just noticed the box right next to it started doing the same thing a few minutes ago. Same
settings (limit = 100 000) and the logs are spitting this out:
2012-10-01 11:19:10.756967 [CRIT] switch_core_session.c:2049 Over Session Limit! 3372
2012-10-01 11:19:10.756967 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-01 11:19:10.756967 [CRIT] switch_core_session.c:2049 Over Session Limit! 3372
2012-10-01 11:19:10.756967 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-01 11:19:10.756967 [CRIT] switch_core_session.c:2049 Over Session Limit! 3372
2012-10-01 11:19:10.756967 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-01 11:19:10.777952 [CRIT] switch_core_session.c:2049 Over Session Limit! 3372
2012-10-01 11:19:10.777952 [CRIT] mod_sofia.c:4543 Error Creating Session
In an older log file, I see this:
2012-10-01 11:15:40.758027 [CRIT] switch_core_session.c:1519 LUKE: I'm hit, but not bad.
2012-10-01 11:15:40.758027 [CRIT] switch_core_session.c:1559 Thread Failure!
2012-10-01 11:15:40.758027 [CRIT] switch_core_session.c:1520 LUKE'S VOICE: Artoo, see
what you can do with it. Hang on back there....
Green laserfire moves past the beeping little robot as his head turns. After a few beeps and a
twist of his mechanical arm,
Artoo reduces the max sessions to 5721 thus, saving the switch from certain doom.
2012-10-01 11:15:40.758027 [CRIT] switch_core_session.c:1559 Thread Failure!
2012-10-01 11:15:40.758027 [CRIT] switch_core_session.c:1519 LUKE: I'm hit, but not bad.
2012-10-01 11:15:40.758027 [CRIT] switch_core_session.c:1520 LUKE'S VOICE: Artoo, see
what you can do with it. Hang on back there....
Green laserfire moves past the beeping little robot as his head turns. After a few beeps and a
twist of his mechanical arm,
Artoo reduces the max sessions to 5721 thus, saving the switch from certain doom.
2012-10-01 11:15:40.758027 [CRIT] switch_core_session.c:1559 Thread Failure!
2012-10-01 11:15:40.758027 [CRIT] switch_core_session.c:1519 LUKE: I'm hit, but not bad.
2012-10-01 11:15:40.758027 [CRIT] switch_core_session.c:1520 LUKE'S VOICE: Artoo, see
what you can do with it. Hang on back there....
Green laserfire moves past the beeping little robot as his head turns. After a few beeps and a
twist of his mechanical arm,
Artoo reduces the max sessions to 5721 thus, saving the switch from certain doom.
2012-10-01 11:15:40.758027 [CRIT] switch_core_session.c:1559 Thread Failure!
2012-10-01 11:15:40.758027 [CRIT] switch_core_session.c:1519 LUKE: I'm hit, but not bad.
2012-10-01 11:15:40.758027 [CRIT] switch_core_session.c:1520 LUKE'S VOICE: Artoo, see
what you can do with it. Hang on back there....
Green laserfire moves past the beeping little robot as his head turns. After a few beeps and a
twist of his mechanical arm,
Artoo reduces the max sessions to 5721 thus, saving the switch from certain doom.
2012-10-01 11:15:40.758027 [CRIT] switch_core_session.c:1559 Thread Failure!
Here's what TOP says:
top - 11:23:21 up 87 days, 1:14, 1 user, load average: 2369.08, 1496.55, 1436.02
Tasks: 245 total, 1 running, 243 sleeping, 0 stopped, 1 zombie
Cpu(s): 31.0%us, 16.6%sy, 0.0%ni, 51.8%id, 0.1%wa, 0.1%hi, 0.4%si, 0.0%st
Mem: 49447640k total, 30837392k used, 18610248k free, 479752k buffers
Swap: 51511288k total, 192k used, 51511096k free, 25373164k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16181 root -2 -10 7270m 2.3g 6372 S 501.0 4.9 22:41.48 freeswitch
Plenty of CPU left and plenty of RAM left.
Comment by Umberto Mautone [ 01/Oct/12 ]
Just restarted the process and within a couple minutes it reduced the session limit to 353?!?
2012-10-01 11:25:13.822784 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-01 11:25:13.822784 [CRIT] switch_core_session.c:2049 Over Session Limit! 353
2012-10-01 11:25:13.822784 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-01 11:25:13.902793 [CRIT] switch_core_session.c:2049 Over Session Limit! 353
2012-10-01 11:25:13.902793 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-01 11:25:13.902793 [CRIT] switch_core_session.c:2049 Over Session Limit! 353
2012-10-01 11:25:13.902793 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-01 11:25:13.902793 [CRIT] switch_core_session.c:2049 Over Session Limit! 353
2012-10-01 11:25:13.902793 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-01 11:25:13.902793 [CRIT] switch_core_session.c:2049 Over Session Limit! 353
2012-10-01 11:25:13.902793 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-01 11:25:13.902793 [CRIT] switch_core_session.c:2049 Over Session Limit! 353
2012-10-01 11:25:13.902793 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-01 11:25:13.922760 [CRIT] switch_core_session.c:2049 Over Session Limit! 353
Comment by Umberto Mautone [ 01/Oct/12 ]
I just watched a box suddenly drop from nearly 6000 calls down to 150 calls and the logs say
session limit was lowered to 335.
What could suddenly be causing thread creation failures so frequently?
Comment by Umberto Mautone [ 01/Oct/12 ]
Is there any way I can disable this feature? These are in production and I'm forced to sit in front
of them to manually reset the sessions every few minutes.
Comment by Peter Olsson [ 01/Oct/12 ]
There must be something going on in your server. What happens here is that the call to
switch_thread_create() actually fails, so there is really nothing else FS could do. It could try to
just go on, but that would just totally crasch the server eventually.
You need to look for other clues, FS just reports that the OS can't create any more threads.
/Peter
Comment by Anthony Minessale II [ 01/Oct/12 ]
Apply this patch to get an idea what the error is please.
Comment by Umberto Mautone [ 01/Oct/12 ]
Ok, I'll apply the patch and report back.
Is there any way to rate-limit inbound calls in SOFIA rather than using session limiters? The
reason I ask is that, using session limits, a flood of inbound calls can prevent the creation of
outbound calls, which create a session. So, when I'm route advancing a call already seized,
outbound calls fail due to the inbound flood hitting the session rate limit.
Comment by Anthony Minessale II [ 01/Oct/12 ]
add {no_throttle_limits=true} to your outbound calls to have them not count in sps
Comment by Umberto Mautone [ 01/Oct/12 ]
Here's what the extra reporting gives us:
2012-10-01 13:17:39.186726 [CRIT] switch_apr.c:658 ERROR: THREAD CREATE
[11][Resource temporarily unavailable]
2012-10-01 13:17:39.186726 [CRIT] switch_core_session.c:1559 Thread Failure!
2012-10-01 13:17:39.186726 [CRIT] switch_core_session.c:1519 LUKE: I'm hit, but not bad.
2012-10-01 13:17:39.186726 [CRIT] switch_core_session.c:1520 LUKE'S VOICE: Artoo, see
what you can do with it. Hang on back there....
Green laserfire moves past the beeping little robot as his head turns. After a few beeps and a
twist of his mechanical arm,
Artoo reduces the max sessions to 6213 thus, saving the switch from certain doom.
2012-10-01 13:17:39.186726 [CRIT] switch_apr.c:658 ERROR: THREAD CREATE
[11][Resource temporarily unavailable]
2012-10-01 13:17:39.186726 [CRIT] switch_core_session.c:1559 Thread Failure!
2012-10-01 13:17:39.186726 [CRIT] switch_core_session.c:1519 LUKE: I'm hit, but not bad.
2012-10-01 13:17:39.186726 [CRIT] switch_apr.c:658 ERROR: THREAD CREATE
[11][Resource temporarily unavailable]
2012-10-01 13:17:39.186726 [CRIT] switch_core_session.c:1559 Thread Failure!
2012-10-01 13:17:39.186726 [CRIT] switch_core_session.c:1520 LUKE'S VOICE: Artoo, see
what you can do with it. Hang on back there....
Green laserfire moves past the beeping little robot as his head turns. After a few beeps and a
twist of his mechanical arm,
Artoo reduces the max sessions to 6213 thus, saving the switch from certain doom.
2012-10-01 13:17:39.186726 [CRIT] switch_apr.c:658 ERROR: THREAD CREATE
[11][Resource temporarily unavailable]
2012-10-01 13:17:39.186726 [CRIT] switch_core_session.c:1559 Thread Failure!
2012-10-01 13:17:39.186726 [CRIT] switch_core_session.c:1519 LUKE: I'm hit, but not bad.
2012-10-01 13:17:39.186726 [CRIT] switch_core_session.c:1520 LUKE'S VOICE: Artoo, see
what you can do with it. Hang on back there....
Green laserfire moves past the beeping little robot as his head turns. After a few beeps and a
twist of his mechanical arm,
Artoo reduces the max sessions to 6213 thus, saving the switch from certain doom.
2012-10-01 13:17:39.186726 [CRIT] switch_apr.c:658 ERROR: THREAD CREATE
[11][Resource temporarily unavailable]
Comment by Umberto Mautone [ 01/Oct/12 ]
There were nearly 5000 ports going at the time thread creation failed. It set the new session limit
to 2544.
2012-10-01 13:20:06.525769 [CRIT] switch_core_session.c:2049 Over Session Limit! 2544
2012-10-01 13:20:06.525769 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-01 13:20:06.525769 [CRIT] switch_core_session.c:2049 Over Session Limit! 2544
2012-10-01 13:20:06.525769 [CRIT] mod_sofia.c:4543 Error Creating Session
Comment by Anthony Minessale II [ 01/Oct/12 ]
can you update to latest and try it
Comment by Umberto Mautone [ 01/Oct/12 ]
I updated just before patching. Git log says bf060c6396c8b45b4dff154b00395744fceee2ea
Is there a later update you just pushed?
EDIT:
Full log detail of latest commit:
commit bf060c6396c8b45b4dff154b00395744fceee2ea
Author: Jeff Lenk <jeff@jefflenk.com>
Date: Mon Oct 1 08:56:51 2012 -0500
FS-4664 --resolve
Comment by Anthony Minessale II [ 01/Oct/12 ]
After this many years I would hope you would figure out when I say "can you update and try
latest" I mean literally as of the time of the post and not from an hour ago ;)
Good time to remind you we offer real support contracts for this kind of situation....
Comment by Umberto Mautone [ 01/Oct/12 ]
Ok, this is completely bizarre. There are about 900 calls on and the log threw this out for a few
minutes, then stopped:
2012-10-01 13:48:21.594414 [CRIT] switch_core_session.c:2049 Over Session Limit! 31772
2012-10-01 13:48:21.594414 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-01 13:48:21.594414 [CRIT] switch_core_session.c:2049 Over Session Limit! 31772
2012-10-01 13:48:21.594414 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-01 13:48:21.594414 [CRIT] switch_core_session.c:2049 Over Session Limit! 31772
2012-10-01 13:48:21.594414 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-01 13:48:21.594414 [CRIT] switch_core_session.c:2049 Over Session Limit! 31772
2012-10-01 13:48:21.594414 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-01 13:48:21.594414 [CRIT] switch_core_session.c:2049 Over Session Limit! 31772
Comment by Umberto Mautone [ 01/Oct/12 ]
You're absolutely right. LOL!
Ok, updating now.
Comment by Umberto Mautone [ 01/Oct/12 ]
So far, so good but peak traffic has dropped. I'm expecting another wave of peak traffic later
today.
Comment by Umberto Mautone [ 01/Oct/12 ]
Currently in second peak. It looks like the thread creation problem is gone. Your fix may also
have addressed what seems to have been a memory leak. The process hasn't grown significantly
in RAM as it normally does under heavy load. At this volume, it's usually around 4.5g by this
time but, as of right now, it has not yet reached 2g. I've also noticed that it is better at releasing
memory when the number of channels drop.
Comment by Umberto Mautone [ 02/Oct/12 ]
The problem is still there. The odd thing is that there was never anywhere near 32k sessions.
This particular box peaked at 2500 ports (~ 5000 sessions).
2012-10-02 14:09:11.734783 [CRIT] switch_core_session.c:2050 Over Session Limit! 31987
2012-10-02 14:09:11.734783 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-02 14:09:11.734783 [CRIT] switch_core_session.c:2050 Over Session Limit! 31987
2012-10-02 14:09:11.734783 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-02 14:09:11.734783 [CRIT] switch_core_session.c:2050 Over Session Limit! 31987
2012-10-02 14:09:11.734783 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-02 14:09:11.734783 [CRIT] switch_core_session.c:2050 Over Session Limit! 31987
2012-10-02 14:09:11.734783 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-02 14:09:11.734783 [CRIT] switch_core_session.c:2050 Over Session Limit! 31987
2012-10-02 14:09:11.734783 [CRIT] mod_sofia.c:4543 Error Creating Session
I'm running another "make current" right now in case I missed any changes since my update
above.
Comment by Umberto Mautone [ 02/Oct/12 ]
Same problem with git "10657f3f5c1fb00d6b8e85d3956c170695b169bd". What I don't
understand is where it's getting 32k sessions from. There are just under 1500 calls right now on
the switch. The entries below are being poured into the log file as I'm typing this at a very high
rate yet the number of calls on the switch isn't even in the same ballpark.
2012-10-02 15:48:46.020353 [CRIT] switch_core_session.c:2050 Over Session Limit! 31860
2012-10-02 15:48:46.020353 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-02 15:48:46.020353 [CRIT] switch_core_session.c:2050 Over Session Limit! 31860
2012-10-02 15:48:46.020353 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-02 15:48:46.020353 [CRIT] switch_core_session.c:2050 Over Session Limit! 31860
2012-10-02 15:48:46.020353 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-02 15:48:46.040295 [CRIT] switch_core_session.c:2050 Over Session Limit! 31860
2012-10-02 15:48:46.040295 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-02 15:48:46.060303 [CRIT] switch_core_session.c:2050 Over Session Limit! 31860
2012-10-02 15:48:46.060303 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-02 15:48:46.060303 [CRIT] switch_core_session.c:2050 Over Session Limit! 31860
2012-10-02 15:48:46.060303 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-02 15:48:46.060303 [CRIT] switch_core_session.c:2050 Over Session Limit! 31860
2012-10-02 15:48:46.060303 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-02 15:48:46.080347 [CRIT] switch_core_session.c:2050 Over Session Limit! 31860
2012-10-02 15:48:46.080347 [CRIT] mod_sofia.c:4543 Error Creating Session
2012-10-02 15:48:46.080347 [CRIT] switch_core_session.c:2050 Over Session Limit! 31860
2012-10-02 15:48:46.080347 [CRIT] mod_sofia.c:4543 Error Creating Session
Comment by Peter Olsson [ 03/Oct/12 ]
Maybe you have someone hacking/DDOS'ing your server, and sending lots of INVITE's to you?
I would start by grep'ing the network for INVITE, and see if everything really looks normal.
Comment by Umberto Mautone [ 03/Oct/12 ]
There's no DDoS or hacking going on.
Comment by Peter Olsson [ 04/Oct/12 ]
Either there are too many sessons being created, or sessions are not destroyed correctly, and
linger inside FS.
I can't believe that FS all of a sudden can't count.. :)
What did you do when this occured the first time, that's what you need to figure out, that's
probably the reason and solution for this problem.
Did you acually check your traffic for DDOS etc?
Comment by Anthony Minessale II [ 04/Oct/12 ]
Is say update one more time, if you still get it, you will have to run fa with -np arg to disable
priority threads because you must be pushing the box more than it can take.....
Comment by Umberto Mautone [ 04/Oct/12 ]
I'll update right now.
The box is really not being pushed very hard. I'm attaching two graphs showing both port usage
and CPS. The boxes are pretty beefy and I've disabled all but the critical log level. I've taken
these boxes as high as 10k ports @ around 500 cps in the past without seeing this issue. Today's
traffic was pretty slow on this box with ports reaching just under 2700 and cps slightly over 200.
Around 11am (EDT) today, it started spitting out thread failures:
=================
2012-10-03 11:00:26.061192 [CRIT] switch_core_session.c:1520 LUKE'S VOICE: Artoo, see
what you can do with it. Hang on back there....
Green laserfire moves past the beeping little robot as his head turns. After a few beeps and a
twist of his mechanical arm,
Artoo reduces the max sessions to 19654 thus, saving the switch from certain doom.
2012-10-03 11:00:26.061192 [CRIT] switch_core_session.c:1560 Thread Failure!
2012-10-03 11:00:26.061192 [CRIT] switch_core_session.c:1519 LUKE: I'm hit, but not bad.
2012-10-03 11:00:26.061192 [CRIT] switch_core_session.c:1520 LUKE'S VOICE: Artoo, see
what you can do with it. Hang on back there....
Green laserfire moves past the beeping little robot as his head turns. After a few beeps and a
twist of his mechanical arm,
Artoo reduces the max sessions to 19654 thus, saving the switch from certain doom.
=================
After filling up 16 log files with the entries above between 11:00:05 and 11:00:27 (for 22
seconds straight), it finally stopped and settled on resetting the session limit at 19047. After that,
it would sporadically throw a session limit error ever 20 minutes or so.
=================
2012-10-04 16:23:14.520930 [CRIT] switch_core_session.c:2050 Over Session Limit! 19047
2012-10-04 16:51:47.240655 [CRIT] switch_core_session.c:2050 Over Session Limit! 19047
2012-10-04 17:02:12.941024 [CRIT] switch_core_session.c:2050 Over Session Limit! 19047
2012-10-04 17:04:38.781032 [CRIT] switch_core_session.c:2050 Over Session Limit! 19047
2012-10-04 17:11:03.220816 [CRIT] switch_core_session.c:2050 Over Session Limit! 19047
=================
As you can see by the graphs, there was no flood sudden flood.
Comment by Peter Olsson [ 05/Oct/12 ]
And as you can see from the logs you had 19046 sessons active :)
How are your graphs generated? What does fs_cli commands status and show channels say by
this time? How many threads had been created by FS at this time?
I never got any reply for what changes you did when this occured for the first time - that is the
key to the solution I would say...
Comment by Umberto Mautone [ 05/Oct/12 ]
The sqlite3 scoreboard is disabled, which was a major bottleneck in performance and would
cause major PDD in high CPS environments. So, "show channels" gives nothing back.
The dialplan sends everything into my custom module, which is where the hooks are to count
calls and CPS. No changes have been made to my module since June other than adding
"no_throttle_limits=true" as Anthony suggested above on the 1st of October.
The problem appeared after a more recent "make current", though I couldn't tell you when
because I only noticed it when trying to figure out why so many outbound calls suddenly started
failing, which turned out to be the problem above and Anthony's "no_throttle_limits=true"
suggestion took care of that part.
The system logs are set to only log critical errors (for performance reasons) yet the log directory
was full of these logs every day. I couldn't even use the logs as an indicator to when the problem
started because they're purged daily.
If you look near the top of this thread, you'll see that the system sometimes chopped the session
limit down to the 300 to 3000 range and sometimes within a few minutes after restarting before
Anthony's changes. The TOP output above when the box reset sessions to 3372 wasn't being
taxed with CPU at about 50% and FS process at 2.3g. After Anthony's changes and his
suggestion to add "no_throttle_limits=true" it seems to be occurring at high numbers.
I agree that perhaps something is not releasing sessions. Could it be that new sessions are being
created faster than old sessions are being destroyed? I increase my module's channel counter
upon entering SWITCH_STANDARD_APP and decrease the channel counter when I'm exiting
my module and returning control to FS.
Comment by Peter Olsson [ 05/Oct/12 ]
Another possibility is that you are flooded with "bad" INVITE's, which doesn't even make it to
your module, but makes it far enough to allocate the session internally in FS.
I still suggest to check the traffic on the interface, and look for strange number of INVITE
packets coming in.
/Peter
Comment by Umberto Mautone [ 05/Oct/12 ]
I checked that yesterday when you first suggested it using wireshark. I didn't see any bad
INVITEs or INVITES coming from unauthorized IPs. I also have a routine that logs attempts
from unauthorized IPs.
I quickly glanced at the whole mechanism a while back and, as I recall, a malformed INVITE
wouldn't get past SOFIA and, instead, would get rejected before creating a session. If it does get
through SOFIA, it creates a session and would fall into the dialplan, which calls my app.
Just for clarity, here's the one and only condition in the dialplan:
<condition field="network_addr" expression="^">
Comment by Anthony Minessale II [ 05/Oct/12 ]
I suggested using -np in case recent changes to thread priority code does not work with the
CentOS version you are on.
Otherwise, try updating your CentOS to 5.6 final which has some kernel improvements.
If you are really brave, give 6.3 or Debian 6 a try to compare results. There have been some
very powerful improvements with more modern kernels regarding FS.
Comment by Umberto Mautone [ 05/Oct/12 ]
I updated last night after your post. Will advise after peak traffic hits the box. If it fails, I'll add
the "-np" option.
I've also upgraded the boxes to the latest 5.x packages & kernel.
My fear of moving up to 6.x has to do with FS-4291 and Red Hat's kernel code "improvements"
that cause some programs to eat CPU. I haven't tried 6.3 and can't afford to put a box into
production on 6.x until you guys give it your blessing.
Comment by Umberto Mautone [ 05/Oct/12 ]
Speak of the devil, it just happened again. Logs dumped 30 megs in 3 seconds:
================================
2012-10-05 11:09:13.999925 [CRIT] switch_core_session.c:1520 LUKE'S VOICE: Artoo, see
what you can do with it. Hang on back there....
Green laserfire moves past the beeping little robot as his head turns. After a few beeps and a
twist of his mechanical arm,
Artoo reduces the max sessions to 29443 thus, saving the switch from certain doom.
2012-10-05 11:09:13.999925 [CRIT] switch_core_session.c:1560 Thread Failure!
2012-10-05 11:09:13.999925 [CRIT] switch_core_session.c:1519 LUKE: I'm hit, but not bad.
2012-10-05 11:09:13.999925 [CRIT] switch_core_session.c:1520 LUKE'S VOICE: Artoo, see
what you can do with it. Hang on back there....
Green laserfire moves past the beeping little robot as his head turns. After a few beeps and a
twist of his mechanical arm,
Artoo reduces the max sessions to 29443 thus, saving the switch from certain doom.
================================
Calls peaked at about 2200 channels (4400 sessions) and CPS at 150.
I'll restart FS with the "-np" option and report back.
Comment by Anthony Minessale II [ 05/Oct/12 ]
We have preliminary reports that 6.3 seems to not have problems. We are for sure confident on
deb6 as many people with cent6.0 problems fixed their problems by migrating to debian. I am
sure in the end they will all work the same since they are very close to the same version.
I highly recommend that you at least prototype one box on a newer distro to get advantage of
the timing and thread performance improvements especially at the level of abuse you put on
machines.
Comment by Anthony Minessale II [ 05/Oct/12 ]
Wait a minute?
Artoo does not lie....
"Artoo reduces the max sessions to 29443"
That means you had at least 30k sessions up at once. You are poking at the maximum number of
threads you can have open on your box without significant sysctl tweaks.
You must have something going on in this box that has started an insane number of sessions.
This is the function we run when the thread_create fails: Note it gets the current number of
running sessions and reduces the max to that number - 100
static void thread_launch_failure(void)
{
uint32_t sess_count;
switch_mutex_lock(session_manager.mutex);
sess_count = switch_core_session_count();
if (sess_count > 110) {
switch_core_session_limit(sess_count - 10);
switch_log_printf(SWITCH_CHANNEL_LOG, SWITCH_LOG_CRIT, "LUKE: I'm hit,
but not bad.\n");
switch_log_printf(SWITCH_CHANNEL_LOG, SWITCH_LOG_CRIT,
"LUKE'S VOICE: Artoo, see what you can do with it. Hang on back there....\n"
"Green laserfire moves past the beeping little robot as his head turns. "
"After a few beeps and a twist of his mechanical arm,\n"
"Artoo reduces the max sessions to %d thus, saving the switch from certain
doom.\n", sess_count - 10);
}
switch_mutex_unlock(session_manager.mutex);
}
Comment by Umberto Mautone [ 05/Oct/12 ]
You have a copy of the code to my module. It doesn't do anything with threads or sessions other
than placing an outbound call and the call counter increments upon entering my module and
decrements upon exiting. I used tcpdump to scan for unauthorized calls hitting the box in case
Peter was correct in that sessions are launched with bad INVITES but nothing there, either.
The box is dedicated to FS. It runs an instance of FS and a PERL script that forks up to 6
processes of the script to stream CDR to the back end. It doesn't do anything else.
The way it's set, the only things I can see creating sessions are inbound calls and outbound calls,
which, together, form a call channel. Unless I'm missing something, I can't see what else can
create sessions and not hit my module with the catch-all dialplan I set up. All I can think of is
that sessions are being created faster than they're being destroyed as a possibility after an
outbound call fails or I exit my module's application.
Comment by Anthony Minessale II [ 05/Oct/12 ]
You *are* missing something for sure. Did you read my last post?
FS does not imagine how many sessions it has open, the code that counts sessions has a
dedicated global mutex.
Try the status command periodically to see how many sessions you have.
Replay one of the bad INVITES to a lab server with sipp or something.
Comment by Peter Olsson [ 05/Oct/12 ]
Yes, that's what I was trying to explain - something is really strange for sure, and something
allocates lots of sessions.
I think the start of this issue had some strange things inside FS (when some new priority tweaks
was implemented), but after that there must be something outside FS causing this load, and
forcing FS to reduce the max sessions.
/Peter
Comment by Anthony Minessale II [ 05/Oct/12 ]
Apply this patch and at least log at CRIT level either in console or freeswitch.log
This will log every session creation and which line of code called for it.
Comment by Umberto Mautone [ 05/Oct/12 ]
Here's an excerpt of the log. I grep'd the log files to exclude "sofia" and it looks like there's
nothing creating sessions other than sofia.c and mod_sofia.c. Session limit was reduced to 6089
this time.
2012-10-05 13:46:14.016875 [CRIT] mod_sofia.c:4542 CREATE SESSION 282255
2012-10-05 13:46:14.016875 [CRIT] mod_sofia.c:4542 CREATE SESSION 282256
2012-10-05 13:46:14.016875 [CRIT] sofia.c:1862 CREATE SESSION 282257
2012-10-05 13:46:14.016875 [CRIT] sofia.c:1862 CREATE SESSION 282258
2012-10-05 13:46:14.036885 [CRIT] sofia.c:1862 CREATE SESSION 282259
2012-10-05 13:46:14.056953 [CRIT] sofia.c:1862 CREATE SESSION 282260
2012-10-05 13:46:14.056953 [CRIT] sofia.c:1862 CREATE SESSION 282261
2012-10-05 13:46:14.056953 [CRIT] mod_sofia.c:4542 CREATE SESSION 282262
2012-10-05 13:46:14.056953 [CRIT] mod_sofia.c:4542 CREATE SESSION 282263
2012-10-05 13:46:14.077288 [CRIT] sofia.c:1862 CREATE SESSION 282264
2012-10-05 13:46:14.077288 [CRIT] sofia.c:1862 CREATE SESSION 282265
2012-10-05 13:46:14.097956 [CRIT] sofia.c:1862 CREATE SESSION 282266
2012-10-05 13:46:14.097956 [CRIT] mod_sofia.c:4542 CREATE SESSION 282267
2012-10-05 13:46:14.097956 [CRIT] switch_core_session.c:2050 Over Session Limit! 6089
2012-10-05 13:46:14.097956 [CRIT] mod_sofia.c:4542 CREATE SESSION 282268
2012-10-05 13:46:14.097956 [CRIT] mod_sofia.c:4542 CREATE SESSION 282269
2012-10-05 13:46:14.097956 [CRIT] mod_sofia.c:4542 CREATE SESSION 282270
2012-10-05 13:46:14.116963 [CRIT] mod_sofia.c:4542 CREATE SESSION 282271
2012-10-05 13:46:14.116963 [CRIT] mod_sofia.c:4542 CREATE SESSION 282272
2012-10-05 13:46:14.116963 [CRIT] mod_sofia.c:4542 CREATE SESSION 282273
2012-10-05 13:46:14.156974 [CRIT] mod_sofia.c:4542 CREATE SESSION 282274
2012-10-05 13:46:14.156974 [CRIT] mod_sofia.c:4542 CREATE SESSION 282275
2012-10-05 13:46:14.196956 [CRIT] mod_sofia.c:4542 CREATE SESSION 282276
2012-10-05 13:46:14.196956 [CRIT] mod_sofia.c:4542 CREATE SESSION 282277
2012-10-05 13:46:14.236966 [CRIT] mod_sofia.c:4542 CREATE SESSION 282278
2012-10-05 13:46:14.276991 [CRIT] mod_sofia.c:4542 CREATE SESSION 282279
2012-10-05 13:46:14.296959 [CRIT] mod_sofia.c:4542 CREATE SESSION 282280
2012-10-05 13:46:14.316946 [CRIT] mod_sofia.c:4542 CREATE SESSION 282281
2012-10-05 13:46:14.376955 [CRIT] mod_sofia.c:4542 CREATE SESSION 282282
2012-10-05 13:46:14.376955 [CRIT] mod_sofia.c:4542 CREATE SESSION 282283
2012-10-05 13:46:14.396954 [CRIT] mod_sofia.c:4542 CREATE SESSION 282284
2012-10-05 13:46:14.396954 [CRIT] mod_sofia.c:4542 CREATE SESSION 282285
2012-10-05 13:46:14.416961 [CRIT] mod_sofia.c:4542 CREATE SESSION 282286
2012-10-05 13:46:14.416961 [CRIT] mod_sofia.c:4542 CREATE SESSION 282287
2012-10-05 13:46:14.416961 [CRIT] mod_sofia.c:4542 CREATE SESSION 282288
2012-10-05 13:46:14.496940 [CRIT] mod_sofia.c:4542 CREATE SESSION 282289
2012-10-05 13:46:14.516967 [CRIT] mod_sofia.c:4542 CREATE SESSION 282290
2012-10-05 13:46:14.516967 [CRIT] mod_sofia.c:4542 CREATE SESSION 282291
2012-10-05 13:46:14.516967 [CRIT] mod_sofia.c:4542 CREATE SESSION 282292
2012-10-05 13:46:14.536951 [CRIT] mod_sofia.c:4542 CREATE SESSION 282293
The box wasn't very taxed, either, and FS is only at about 4.8g.
top - 17:57:55 up 91 days, 7:49, 1 user, load average: 137.86, 150.78, 149.20
Tasks: 243 total, 2 running, 241 sleeping, 0 stopped, 0 zombie
Cpu(s): 7.1%us, 7.1%sy, 14.3%ni, 64.3%id, 0.0%wa, 0.0%hi, 7.1%si, 0.0%st
Mem: 49447640k total, 35697232k used, 13750408k free, 592332k buffers
Swap: 51511288k total, 192k used, 51511096k free, 27257856k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9346 root 34 19 8562m 4.8g 6368 S 177.9 10.3 507:16.55 freeswitch
1 root 15 0 10364 692 588 S 0.0 0.0 116:31.57 init
2 root RT -5 0 0 0 S 0.0 0.0 0:25.76 migration/0
Comment by Anthony Minessale II [ 05/Oct/12 ]
You are joking me with this "only 4 gigs of ram" and treating 6000 calls like a small number
right?
I didn't expect you would find something other than sofia but you can see a trail of all the
sessions created.
When you get this error it means you do not have any memory left for creating threads.
you can try this
sysctl -w kernel.pid_max=131072
sysctl -w vm.max_map_count=131072
or try a newer OS
You never told me about results with -np
I begin to feel taken advantage of with this kind of call volume and the last several times I have
helped you for free. The least you could do is consider a support contract but with your
expectations you may be "uninsurable"
Comment by Umberto Mautone [ 05/Oct/12 ]
I said the FS process is at 4 GB. The box has 48 GB of RAM (Mem: 49447640k)
The box was running 1200 calls at the time of the logs above, not 6000 calls.
I've taken the box up to about 10k channels in the past at around 350 CPS and it ran fine. You
and I have discussed via email what I've managed to do with the hardware when I saw issues
with CentOS 6.2. Keep in mind this is running with bypass media (i.e. no RTP goes through the
box).
The reason I opened this as a bug is not to get you to enhance the product for free or to fix my
config issues. You know me better than that. When I need your help for a non-bug issue, or
even an urgent bug issue, I always contact you offline to make arrangements. I don't use this
place as a support forum.
If you're telling me this is not a bug and that it's just hitting design limits then I'll stop here and
work with you offline to figure a viable solution as I've always done with you in the past.
That being said, the box didn't exceed 3k simultaneous calls today yet the log said it had to drop
sessions down to 29k then later on it had to drop them to 6k. So, if it's running out of resources,
I'm puzzled as to why it had enough to hit 29k sessions then suddenly didn't have enough to go
beyond 6k sessions, even though, all the while, it never exceeded 3000 simultaneous calls all
day.
Comment by Peter Olsson [ 06/Oct/12 ]
Umberto, you need to dig deeper what happens on your box. According to the log here you did
have 6000 sessions. You really need to trust the number of sessions reported by FS. If you by
this time did a "status" on fs_cli, you would see that it had 6000 sessions active. I suggest you
also try to log te totalt number of sessions when a session is created, and do the same every time
a session is destroyed.
/Peter
Comment by Umberto Mautone [ 06/Oct/12 ]
Guys, I want to sincerely apologize. I have been using two of my production boxes to run these
updates and patches but I just realized I ran "make current" on only one of them and I've been
debugging the wrong box after the fix.
The box that was correctly updated with Anthony's fix was busier, running about 5000 ports @
300 cps, and its logs are empty.
Anthony's fix did the trick.
I'll keep this open until peak traffic resumes Monday just in case.
Comment by Peter Olsson [ 15/Oct/12 ]
Umberto, does everything look ok now? Please close this ticket if you're satisfied.
/Peter
Comment by Jeff Lenk [ 16/Oct/12 ]
Closing
Comment by Umberto Mautone [ 16/Oct/12 ]
Issue has been fixed.
Comment by Subhash Kumar [ 17/Apr/13 ]
Hi,
Can you please let me know the fix you applied as we are also facing the same problem.
Thanks,
Subhash.
Comment by Subhash Kumar [ 18/Apr/13 ]
Mautone,
We also faced the same issue in our testing box so we want know how you solved this issue
.Can you please help us in this issue.
Generated at Wed Feb 10 01:23:10 EST 2016 using JIRA 6.4.10#64025sha1:5b8b74079161cd76a20ab66dda52747ee6701bd6.
Download