BUG: Video signal lost - Adding grey image
Hey
I have got a problem resulting in "Video signal lost - Adding grey image"
I have always noticed a single entry in my log now and then saying "Error reading image header", they have been there a few times a day on different times from each Cam (netcams) and have not seemed to be a problem.
In the latest release i get the "signal lost" error instead and after that i have the gray image until i make a restart of motion witch is not so good from my point of view.
I have done a lot of testing now and found out that the daily snap of 20051224 is working giving that single log entry and goes on but the version of 20051227 gives the staying gray image.
I have yet to try the versions from 26 and 27 Dec. but hope hat you allready now know where i have to find the problem the change i can notice is the ffmpeg_filename to movie_filname change that i thought vas quite new in this release but my intensive testings shows that i has come and gone several times over the last months daily snaps.
It looks to my like something has broken with the result that a lost/damaged image frame make Motion to throw away the cam and not even try to catch up again?
Can I do something, can You ? - im having 3
TrendNet TV-IP100 netcams known to work with motion and has done until now ...
Test case
Start motion and wait to a single frame error shows.
Environment
Motion version: |
3.2.5.1 |
ffmpeg version: |
the One known to work with 3.2.5 |
Shared libraries: |
ffmpeg, mysql |
Server OS: |
Fedora Core 4, Kernel 2.6.15-1.1833 |
--
TheOtherBug - 10 Mar 2006
Follow up
Greg Swift adds these comments on March 12, 2006 ... I too am seeing an increased number of messages "Video signal lost - Adding grey image" that started when I switched to running version 3.2.5.1 . The grey image appears and never gets out of that state. Even when snapshots capture that grey image, the "live" view of the network camera is good (without grey). The camera therefore is responding and is not hung, but motion does not detect that the signal is available and continues to display the grey image for the snapshots forever until I restart motion.
This problem did not happen with this increased frequency when I was running 3.2.4_snap5 and today I reverted back to that snap5 version and are running snap5 now and the problem has gone away. I think that somewhere between 3.2.4_snap5 and 3.2.5.1 there has been a bug introduced in which the grey image condition is never recovered from, it remains in that state until service motion restart. With the 3.2.4_snap5 version of the code, I do sometimes (rare) get the grey image and video signal lost message, but it does eventually recover from that condition and things return to normal.
One difference between my environment and the environment of the original bug report is that I have compiled motion using these options: ./configure --without-v4l --without-mysql --without-ffmpeg --without-pgsql and so therefore ffmpeg and mysql are not the source of the problem. I think that this is an important point (that ffmpeg and/or mysql are NOT the problem). I am however running the exact same kernel 2.6.15-1.1833_FC4. Thank you.
-- 12 Mar 2006
GregSwift
The problem still exist in motion v.3.2.6, it can not be the specific kernel as i have the same problem on FC5 x86_64 on kernel 2.6.15-1.2054 and it can not be ffmpeg as versions is working allright at least upto snap 20051224 with same ffmpeg version.
Some changes must have been made in the netcam code between 3.2.4s5 and 3.2.5 with the result that the camerathread exits if the slightest burp is seen on the network/camstream?
--
TheOtherBug - 23 Mar 2006
This is the exact problem that I have with both versions 3.2.5 and 3.2.6. I am running Gentoo Linux on 2.6.15 kernel. The log only says "Video Signal Lost", the thread keeps running, and the thread never recovers. I thought for awhile that I was the only one experiencing this problem...
--
MichaelHarris - 10 Apr 2006
This is a bug that concerns me too. I cannot understand what it is that has changed since 3.2.4_snap5 that can cause this. We have always - and still have - problems with buggy
V4L drivers for USB cameras and that is hard to do much about except implement the watchdog feature we have in the plans. But the Netcams should recover.
I do not own a real netcam but I simulate one with another computer running Motion. And it recovers fine. One of the changes since 3.2.4_snap5 is the connection string sent. It is supposed to fix something but maybe the same fix breaks something else.
Please try and replace the attached
netcam.c from motion 3.2.4_snap5 and report back if it cures the problem. Save the original netcam.c somewhere else. And then recompile Motion. I have confirmed that Motion 3.2.6 compiles and runs with the older version of netcam.c
If it cures the problem the next step is to find out which of the changes that has changed things. I suspect the line 70
"Host: %s\r\n"
to be a candidate.
Another candidate is line 1547 in the new version is 3.2.6 version of netcam.c (line 1544 in old version)
} while ((retval < 0) && ((errno == EINTR) || (errno == EAGAIN)));
In the old version
} while ((retval == EINTR) || (retval == EAGAIN));
Try some of this and report back your results.
--
KennethLavrsen - 11 Apr 2006
Replacement of "netcam.c" do fix the problem at me, a restart of the cam give the following in the log:
Apr 12 05:30:21 localhost motion: [2] Error reading image header
Apr 12 05:30:31 localhost motion: [2] timeout on connect()
Apr 12 05:30:31 localhost motion: [2] re-opening camera (streaming)
I also have catched the following and like pre 3.2.5 everything goes on like nothing has happen:
Apr 12 07:09:55 localhost motion: [1] Error reading image header
Apr 12 07:59:57 localhost motion: [3] Error reading image header
Apr 12 10:13:40 localhost motion: [1] Error reading image header
I'll now try replacement of line 1544 in the new code as proposed but what about the line 70 ?
edit:
Replacement of line 1547 in netcam.c from v.3.2.6 is what it takes to get things work as ealier but i now wonder what the missing parts from that line does.
--
TheOtherBug - 23 Mar 2006
I had the feeling is was line 1547 because it is potentially an infinite loop.
The problem is most like triggered by the camera having a bug and never completeting sending data back to Motion. The old code is also wrong but recovers. So we need to redo that condition so it both tries to complete the reading and escapes when the camera gets stuck.
Thanks for the feedback. Please keep an eye on this topic. We will most likely have a proposed fixed in a day or two.
--
KennethLavrsen - 12 Apr 2006
Fix record
I have been following this bug report for some time, but haven't had the time to actively pursue it (plus, I was really hoping that someone else would find the problem [that's a special abbreviation for 'my stupid mistake']). I think I now understand what went wrong, and how to fix it.
In version 3.2.4 some changes were made to netcam.c, including putting all the network access ('recv' calls) into a common routine, netcam_recv.c. The handiling of the socket was also changed to use a receive timeout (SO_RCVTIMEO). The code for this was, most unfortunately, just plain wrong. There were two separate problems - first, the way the 'recv' call was done was wrong; second, SO_RCVTIMEO wouldn't really do what was desired. On the first part, because the code was wrong, it was never executed so there was never any problem. Then, a fix was made for the incorrect code, but the initial fix was also wrong (an unsigned size_t was used instead of the correct signed ssize_t), so still the code didn't get executed and there was no problem.
The change between the daily snap of 20051224 and that of 20051227 was that the unsigned variable was changed to the (correct) signed one. Now, at last, the code which was intended to handle a network timeout (finally) got executed - and now the second problem (SO_RCVTIMEO didn't do what was needed) came into play. This resulted in the (incorrect) behaviour noted in this bug report.
I have now made some changes to netcam.c which (I sincerely hope) will fix the problem. I have done away with the attempt to use SO_RCVTIMEO and, instead, added some code to use a 'select' statement to accomplish the desired timeout on a receive. The changed code has been committed to the svn repository, and will show up in the next daily snap. However, before declaring that the problem is fixed, I would greatly appreciate as much testing as possible (I have only done some limited tests which all came out okay).
Thanks to all who have provided the valuable information and observations so far.
--
BillBrack - 16 Apr 2006
The motion-20060416-091524 from
MotionDailySourceSnap contains the fix. Thanks a lot Bill for working on this. Let us hope for good feedback.
--
KennethLavrsen - 16 Apr 2006
Thanks - i'll try this out right away.
Edit:
Everything is looking allright from here, have tried both a camera reset and to unplug the power to the cam, motion goes on and re-connects to the stream when the cam is available again.
Thanks a lot again, have a happy easter outthere.
--
TheOtherBug - 16 Apr 2006
I was having this problem with a D-Link DCS-G900 camera + Debian Sarge system, it would drop every night. I've since downloaded the code from the SVN repository, built it as a .deb package on another system (testing release), and then installed it (plus some dependencies) on the production server. Everything seems to be working now, 3 days without failure.
--
ColinOsterhout - 31 May 2006
Fix in 3.2.7
--
KennethLavrsen - 20 Oct 2006