| Error Messages |
Quite often on the email lists users are getting the same error messages. The solutions are usually quite simple but can the messages can initially be baffling. This chapter is here to list the more common problems and their solutions.
Various components can be run at the command line instead as a cron job for debugging purposes. It is important to ensure that you run the job as the correct user (by using su or sudo) and that you are in the right directory, which should be jffnms/engine directory.
The most common component to run on the command line is the poller. This is because most problems stem from something not updating, such as a RRD file filled with NaN or the status of an interface not changing. The poller command is run with the options:
The IDs are not things like IP addresses, but the internal number that JFFNMS assigns. You can see the ID form each interface and host by viewing the Host Table or Interface Table and looking in the first column after the actions.
Any parameter that is 0 or is not specified means all of that particular type of things. For example, if you want to poll all interfaces for a host, you would just specify that host's id only and leave the other fields blank. To poll host that has ID 42 but only its TCP ports (interface type 2), the command line would be:
A typical poller output will look like the following:
18:46:02 : H 2 : Poller Start : 43 Items. 18:46:02 : H 2 : I 2 : P 10 : snmp_counter:storage_block_size( .1.3..4.2): 1024 -> buffer(: 1 (time P: 19.17 | B: 0.16)) [.. multiple poller item lines ..] 18:46:04 : H 2 : I 2 : P 60 : no_poller(: 0 -> rrd(*): storage_block_size:1024 - storage_block_count:386772 - storage_used_blocks :255424 (time P: 0.1 | B: 24.15))
Each line starts with the local time. The first line for each host tells you what is the host ID and how many interfaces there are. In the example we are polling host id 2 "H 2" and there are 43 interfaces.
A poller group is a series of poller item -> backend pairs that are arranged in priority. Each pair is displayed. The second line shows the following information.
The third line in the example, after the break, shows a typical response for a poller/backend pair for putting the data into the RRD files. The first fields are like for any poller line. The important thing to notice is that each field has a value and that the value makes sense for whatever you are counting.
Also notice that the storage_block_size value is the same value as was given in the snmp_counter poller. JFFNMS poller groups work by the poller item that fetches the value putting it into the buffer, and then the RRD backend yanking the value out to put into the RRD file.
Notice the rrd has the asterix (or star) in the brackets. This means we have used the RRDTool All DS backend. A common mistake is the use the RRD Individual Value poller here. That is only good if you are directly taking a poller value and putting into a RRD file.
The consolidator takes one set of information and transforms it into another. Examples include events to alarms and syslog messages to events. The consolidator can run with the following parameters:
The times parameter specifies the number of times the consolidators are run, with the default being 3 times to cover any things that need to be consolidated a few times.
The Interface Discovery component polls each host that has autodiscovery turned on for each interface type. Depending on the autodiscovery setting, it will then: do nothing; add the interface; generate an event saying it found the interface.
This command is often used when you are trying to debug a new Discovery Plugin. In that case you would just enter the host's ID and new interface type ID.
There are 4 paramters that you can use to launch the interface discovery program:
Now, let's say you have created a new interface type and it has been given the ID of 10001. You have already put your target host in JFFNMS which said it was host number 42. To check your discovery plugin works, with no alarms raised for finding interfaces, run the command:
A common problem with this sort of testing is to test a host where you have autodiscovery turned off. You don't get very much useful information.
There is nothing at all going on here, go back to the Hosts table in the GUI and enable autodiscovery.
You should then get some more information, like the following:
It's found a new interface of type 10001, but nothing was done. Most likely the autodiscovery type is notify or at least not automagic. The important thing is it has found your new interface type if you get these messages.
Every 30 minutes the RRD Analizer is run, checking certain values are within pre-defined limits. If you are attemping to create an email if some value goes too high, running the RRD analizer is your first step.
Just like the poller, the RRD Analizer will print out a lot of useful information.
Each line begins with the server time and the interface number. In this example, we have started the analizer just after 8pm and it is showing the SLA applied to interface 90 (I90).
The first line is just a separator, so you know this group of evaluations is for this interface only. Line two is displaying the time range (from 19:40:00 to 20:05:00) that this analisis will be for and that there are 5 matching measurements. As each poll is done 5 minutes apart and the analysis is for 30 minutes, it should always be 5.
Line 3 shows all the RRD values we have for this interface with the actual average value in brackets. These are the measurements that we will be using. Line 4 is another separator with hyphens, and so it is onto the RRD evaluation.
Before we get to line 5, it will make a lot more sense if we look at the actual SLA group we have applied to Interface 90.
Interface 90 is a "Linux System Info" interface type. The SLA is the "Linux/Unix CPU" SLA. It's definition is in table *
|
Position | Condition |
| 10 | Load Average > 5 |
| 20 | CPU Utilization > 80% |
| 30 | OR |
A simple enough definition. It triggers if the Load Average is above 5 or the CPU Utilization is about 80% (or both).
Now, re-list each sla line:
Condition 0 is the first line of the SLA: Is Load Average higher than 5? The Individual Definition for this SLA is ( <load_average_5> > 5 ) The analizer has plugged in the value for the 5 minute load average (see line 3 above and the load_average_5 value ) into the equation, and correctly worked out that 1 not greater than 5 so the line is FALSE.
Next is CPU Utilization > 80%. The problem here is that there is no direct value for this, so we need to calculate it. Again looking at the individual conditions we get ( cpu_user_ticks + cpu_nice_ticks + cpu_system_ticks ) 100 / cpu_user_ticks + cpu_idle_ticks + cpu_nice_ticks + cpu_system_ticks > 80
Plugging the numbers given in line 5, JFFNMS has calculated that the load is 30.85%. This is not over 80% so again the result is false.
Position 30 in the ruleset was the OR. The above line ORs 0 (false) with 0, giving 0.
Finally we do our final evaluation. As we know from the previous line the result is False, so we do not raise an alarm. This logic has worked out correctly because the servers 5 minute load average was not above 5, nor was the CPU Utlization over 80%.
The first thing to check with the database is go to the Global Setup page and make sure it says the database is OK. Database problems may not be caused by JFFNMS but more likely to be a problem with PHP or how you have setup the database. This section is not a tutorial about how to fix a database, there are better locations for that.
Both types of databases have their own set of problems and fixes that you probably want to check first. For MySQL check Problems and Common Errors appendix on the MySQL website. If you are running PostgreSQL then the FAQ for PostgreSQL is a good source of information for that database type.
By default JFFNMS suppresses the database error messages. If you are having problems with the database, you need to re-enable these messages. The file lib/api.db.inc.php has a function called db_open that then calls either mysql_connect or pg_connect. Both these functions have an at @ symbol in front of them which means suppress errors. Temporarily remove this symbol and reload the page to obtain the error.
Most errors with the database start with the lines "db_ping(mysql) Connection to DB Restored...". This is just JFFNMS trying to get to the database, the real error will be after these lines.
If you want to test it on the command line, make sure you put in the host -t parameter as JFFNMS currently cannot use the Unix socket. The command would be
A very popular email on the JFFNMS list starts of with something like
Almost always this is due to FC4's botched implementation of selinux and you either need to fix it (if you understand how) or disable it. For more information about Fedora and selinux, visit Fedora Core 3 SELinux FAQ.
A common error for first-time users, or once you upgrade to a newer version of JFFNMS. The problem is that you have either no loaded the SQL tables into the database or you have not run the new SQL commands to update the database structure for the new version.
MySQL uses files that end with the suffix .MYI to store the actual information from a table of the same name. For example the table `events' is stored in a file called `events.MYI'. Any error message that mentions these MYI files means the database itself is in trouble, usually some sort of corruption.
The MySQL website has a section on Corrupted MyISAM Tables. The short answer is if you see this problem run the command "REPAIR TABLE tablename" on the MySQL client. The website has links to the REPAIR TABLE and CHECK TABLE syntax.
Some problems are not really JFFNMS' fault but are due to something not setup correctly in PHP. However how these problems appear is dependent on how JFFNMS uses PHP. The PHP website has some good information about PHP Installation and Configuration also some Frequently Asked Questions in their PHP Instalation FAQ
PHP is used by JFFNMS in both "Apache module" and "cgi" modes. Generally speaking if you are trying to do something via a web page, it is running as a Apache module and everything else is via cgi.
The way that PHP is used changes the configuration file location. The following table shows the locations of the main PHP configuration file, php.ini for the various types of systems:
|
System | CGI directory | Apache directory |
| Debian | /etc/php/cgi/ | /etc/php/apache/ |
The Apache PHP modules needs to have certain modules loaded. You can see which ones are active with the System Setup administration menu item.
If you believe that the module is loaded but is not shown on this page, you have to make sure the PHP ini file tells it to load the module with the extension= line and that apache is restarted.
If the System Setup screen doesn't show the modules are loaded then there is no point continuing until that is fixed as a lot of JFFNMS will not work correctly. The exception to this is only one of the database (MySQL or PostgreSQL) modules needs to be loaded and WDDX is not essential.
Just like the Apache PHP module needs the modules, the command line needs the same ones. You can find what modules will be loaded with the "php -m" command.
If you load up the JFFNMS screen and only get PHP source code and not a properly rendered web page then that means that the webserver is not interpreting PHP but just displaying the raw pages. There can be many reasons for this happening, including:
This error message PHP doesn't know anything about the function mysql_connect(). The problem here is this function is part of the PHP mysql module and that module is not loaded.
To load the module, you need to make sure the following line is in your php.ini configuration file:
extension=mysql.so
This problem is best described by you trying to type in some new value in a text box, you the click submit and the old value appears. No matter what box you try, you always get the old value.
JFFNMS needs to have register globals turned on. The problem is that newer versions of PHP have this off by default, which is different to the old default.
The most common time you will see this is when you have just installed JFFNMS and you are trying to update the setup JFFNMS for the first time.
The first thing to do when you suspect there is a problem with the poller is to run it on the command line, see Running poller on command line for information on how to do this.
On new installations, if you have hosts configured and interfaces setup for those interfaces, running the poller on the command line should give you some debugging output. However if you get no output at all, the most likely cause is the php module for the database you are using is not being loaded for the cgi/cli PHP version. See Modules for cli PHP for details.
The section on Running poller on command line describes the output of the poller. If you are getting nothing between the : and the -> of the poller lines then it means the poller itself is returning nothing.
For SNMP-based pollers, this could mean either the SNMP module is not loaded or the snmpget command is not found if you are using some higher versions of SNMP.
There are only so many things that can go wrong with the TFTP transfer of the hosts configuration. The first thing to know is that this only works with certain Cisco devices. Assuming you have the correct device, check the following:
You can manually run the TFTP configuration collector, but make sure you do it as the right user id, by running php4 -q tftp_get_host_config.php . You need to be in the engine directory and the PHP binary may be called php and not php4.
When you run the script on the command line you may get an error like
<b>Warning</b>: Error in packet. Reason: (noSuchName) There is no such variable name in this MIB. in <b>/usr/share/jffnms/engine/configs/cisco_catos.inc.php</b> on line <b>16</b><br /> <br /> <b>Warning</b>: This name does not exist: enterprises.9.5.1.5.1.0 in <b>/usr/share/jffnms/engine/configs/cisco_catos.inc.php</b> on line <b>16</b><br />
This means you have the wrong Config transfer mode.
The consolidator is responsible for getting the events happening from the raw tables. If your NMS is going strangely silent or you are getting strange alarms then it might be the consolidator.
You may have an event that should be raising an alarm for an interface, but for some reason this is not happening. This problem usually only happens when you are creating new interface types.
The first thing is to check that the event is actually supposed to raise an alarm. Go to the Event Types which is found in the Administration menu item Internal Configuration => Event Analyzer => Event Types. Find the event you are working on and make sure that the column "Event Generates an Alarm?" is checked.
Another common reason for no alarm is because the consolidator cannot link the event to the interface. For an event to be consolidated into an alarm, it must have a known host, a valid interface field that matches the "interface.interface" field of the interface, and a valid state field.
Also read the section on Triggers, especially the part about the debug log format. If you see a "Then email (1)" then look at your server and how it handles email. If you don't see a line ending with that, then the problem is (at least) within JFFNMS.
The email action is pretty simple, it takes a bunch of fields and uses the PHP mail function to send the email. If the trigger debug logs show a return value of 0, it can only be a few things. Those things are, in order of most to least likelyhood:
This problem is not mysql's fault but a problem with the mysql databases. The problem is the consolidator attempts to insert new events and it cannot due to a duplicate key.
Running the consolidator on the command line shows the problem:
jffnmshost:/usr/share/jffnms/engine# sudo -u jffnms php4 -q consolidate.php
SYSLOG Events to Process: 212857
string(67) ``rsyncd[20922]: rsync to blah from fred.mynet (192.168.10.1)''
SYSLOG Message ID: 2433507 // Host: 18 // interface(: // state(: // username(: // info(: rsyncd[20922]: rsync to blah from fred.mynet (192.168.10.1)
Query Failed - table_insert(events) - insert into events (date,type,host,interface,state,username,info,referer) VALUES ('2003-08-13 00:00:02','1','18','','','','syncd[20922]: rsync to blah from fred.mynet 192.168.10.1','2433509') - Duplicate entry '2581292' for key 1
The important thing is the error message "`Query Failed ...Duplicate entry (some number) for key 1. This means you are trying to add a new line to the database and the value for the primary key is already in there. The problem is that you are not specifying the primary key column ("id"), so how can you be adding a duplicate?
The underlying problem is that the mysql table is corrupt and has forgotten what the next id number should be. It's gotten terribly confused and has one part of thinking the maximum value for id is 2581292 but another part thinking it is some other, higher number.
The solution is simple, login to mysql and repair the table:
mysql> repair table events; +---------------+--------+----------+----------------------------------------------+ | Table | Op | Msg_type | Msg_text | +---------------+--------+----------+----------------------------------------------+ | jffnms.events | repair | warning | Number of rows changed from 337934 to 337941 | | jffnms.events | repair | status | OK | +---------------+--------+----------+----------------------------------------------+ 2 rows in set (43.63 sec)
Then re-run consolidator again and wait for the rush of new events. I have never heard of this problem with PostgreSQL, if you do see it and are using PostgreSQL please let us know.
First check that the files really are not being created. The JFFNMS performance screens can sometimes say there are no files for an interface when the files are there but there is a problem with the code or the file.
The files are created in the RRD Files Path configuration item, which you can see in the Setup screen (System Setup in the Administration Menu ). You can then check in that directory to see if files interface-NN-Y.rrd are there. NN will be the interface ID and the Y will be a number like 0 or 1.
Next place to check is the log files of the Apache server. Any child processes, such as PHP or RRDTool, that print to stderr will get their information logged here, so it is a good place to see if one of those tools is mis-behaving.
An example error message is
In this case there is a missing / (var instead of /var) that is causing the problem. To fix this edit the config.ini either directly or via the System Setup menu.
No such file or directory means either the tempengine directory doesn't exist or the file could not be created.
A related error message in the apache error logs is
This means the user the web server is running under is unable to write files to that directory. Make sure that whatever user or group the webserver is running as is able to write files into that directory.
You will see this error message in the Apache error log. This problem is due to the incompatible linking between RRDTool the libpng and libgd libraries. It is also a reasonably difficult problem to fix as it involves recompiling.
You can see the double-linking with the ldd command, such as
$ ldd /usr/bin/rrdtool
librrd.so.0 => /usr/lib/librrd.so.0 (0x000002000002a000)
libpng.so.2 => /usr/lib/libpng.so.2 (0x000002000005c000)
libgd-gif.so.1 => /usr/lib/libgd-gif.so.1 (0x00000200000ae000)
libm.so.6.1 => /lib/libm.so.6.1 (0x00000200000f6000)
libc.so.6.1 => /lib/libc.so.6.1 (0x0000020000184000)
libpng12.so.0 => /usr/lib/libpng12.so.0 (0x000002000031a000)
libz.so.1 => /usr/lib/libz.so.1 (0x000002000035a000)
/lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x0000020000000000)
libpng is linked twice in this case, and there are two different versions, namely libpng.so.2 and libpng12.so.0. A setup such as this is almost guaranteed to give some sort of problems.
There is essentially three known reasons why there are gaps in the graphs.
The first is that the polled data has exceeded maximum value for the RRD file. This is more likely to happen if the gaps appear around where the graph has values 15 or 150 Mbps. You will get 0 for the times the graph exceeds the limit.
The RRD file is created using the parameters in the RRDTool Structure DEF column of the Interface Types table. If the column uses the tag <bandwidth> then the values of Bandwidth IN and Bandwidth OUT are used in the following formula:
Bandwidth = 1.5 * Max(Bandwidth IN, Bandwidth OUT)
If you change the maximum value then the RRD file is updated next poll using the rrdtune feature of RRDTool.
Another reason for gaps in graphs is the poller fails to run, or fails to complete. A heavily loaded server or a problem with JFFNMS and/or the database will show up as random missing samples in the graph, with values of 0.
Finally, if you have very high speed interfaces you may get some strange results. The graphs look OK until you reach a certain rate, then they have very low values before finally going back up to where they were. This will occur around the 120-150Mbps mark where it will suddenly show traffic rates in the small 10s of Mbps. This problem is caused by the SNMP counters wrapping around in the 5 minute polling interval ( 65535 * 5 * 60 = 19 MBps = 152 Mbps ). There is no simple solution to this problem. Changing the polling rate, and associated RRD definitions to reflect the new rate, will fix it. Later versions of JFFNMS will use the 64 bit counters to get around this problem, until 10,000 GigE comes along.
JFFNMS Manual, last changed July 3, 2008
| Error Messages |