Troubleshooting a PHD guiding issue

Bhwolf · February 16, 2015, 8:21pm

I lost a number of subs last night which is pretty unusual, so trying to figure out what happened with my guiding. Wind certainly had an impact, but I’m seeing timeouts/gaps in the guide logs. Using a QHY5II-L (ASCOM), latest dev build of PHD, beta 10 of SGPro.

This one sub in particular was 20min from 21:42:58 - 22:03:14. In the SGPro log, it looks like SGPro is querying PHD’s status many times per second (may be unrelated) but then there’s a timeout. The sub finishes but looks like guiding stopped for part of that time. Example with content snipped:

[2/15/2015 9:42:58 PM] [DEBUG] [Camera Thread] SGM_CAMERA_CAPTURE message received…
[2/15/2015 9:42:58 PM] [DEBUG] [Camera Thread] ASCOM (QSI) Camera: setting gain to HIGH…
[2/15/2015 9:42:58 PM] [DEBUG] [Camera Thread] ASCOM (QSI) camera: setting normal speed readout…
[2/15/2015 9:43:01 PM] [DEBUG] [Main Thread] PopulateDataModel: Transferring view to the data model…
[2/15/2015 9:43:01 PM] [DEBUG] [MF Update Thread] Performing serialize…
[2/15/2015 9:47:09 PM] [DEBUG] [PHD2 Listener Thread] PHD2 GetPhdStatus - Pre-Wait : Guiding
[2/15/2015 9:47:09 PM] [DEBUG] [PHD2 Listener Thread] Sending to PHD2:
{“method”: “get_app_state”, “id”: 1001}

[2/15/2015 9:47:09 PM] [DEBUG] [PHD2 Listener Thread] PHD2 GetPhdStatus - Post-Wait: Guiding
[2/15/2015 9:47:09 PM] [DEBUG] [PHD2 Listener Thread] PHD2 GetPhdStatus - Pre-Wait : Guiding
[2/15/2015 9:47:09 PM] [DEBUG] [PHD2 Listener Thread] Sending to PHD2:
{“method”: “get_app_state”, “id”: 1001}

<< SNIP MORE OF ABOVE >>

[2/15/2015 9:47:13 PM] [DEBUG] [PHD2 Listener Thread] PHD2 GetPhdStatus - Post-Wait: Guiding
[2/15/2015 9:47:13 PM] [DEBUG] [PHD2 Listener Thread] PHD2 GetPhdStatus - Pre-Wait : Guiding
[2/15/2015 9:47:13 PM] [DEBUG] [PHD2 Listener Thread] Sending to PHD2:
{“method”: “get_app_state”, “id”: 1001}

[2/15/2015 9:47:13 PM] [DEBUG] [PHD2 Listener Thread] PHD2 GetPhdStatus - Post-Wait: Guiding
[2/15/2015 9:47:13 PM] [DEBUG] [PHD2 Listener Thread] PHD2 Error - Connection Timed Out!
[2/15/2015 9:47:13 PM] [DEBUG] [PHD2 Listener Thread] PHD2 connection terminated…
[2/15/2015 10:02:58 PM] [DEBUG] [Sequence Thread] Waking from exposure time sleep period…
[2/15/2015 10:02:58 PM] [DEBUG] [Sequence Thread] Checking to see if the CCD has a temp…
[2/15/2015 10:02:58 PM] [DEBUG] [Sequence Thread] Saving the CCD temp…
[2/15/2015 10:02:58 PM] [DEBUG] [Sequence Thread] Entering super dangerous loop to await image completion…
[2/15/2015 10:03:29 PM] [DEBUG] [PHD2 Listener Thread] PHD2 GetPhdStatus - Post-Wait: BadConnection
[2/15/2015 10:03:29 PM] [DEBUG] [PHD2 Listener Thread] PHD2 GetPhdStatus - Pre-Wait : BadConnection
[2/15/2015 10:03:29 PM] [DEBUG] [PHD2 Listener Thread] Sending to PHD2:
{“method”: “get_app_state”, “id”: 1001}

Using PHD2 Log Viewer (thanks Andy!) something catastrophic did happen, and there are gaps in the timeline (camera timeouts?) so the timeline at the top of the chart is inaccurate, but my mouse is hovering over the start of the ‘problem area’ at 21:59:17:

Attaching logs, though I certainly don’t expect anyone to comb through them all. Just hoping someone might have some tips on isolating where the problem lies!

Brian

PHD2_DebugLog_2015-02-15_202042-c.zip (738.1 KB)
PHD2_GuideLog_2015-02-15_202042.zip (93.1 KB)
sg_logfile_20150215194217.zip (48.1 KB)

Andy · February 17, 2015, 12:08am

Brian,

SGP should not be slamming phd2 with get_app_state messages, but I’m not convinced that is all there is to the problem.

I looked through your logs and see this at the time of one of the guiding anomalies (and a similar pattern at the other anomalies):

21:49:00.060 00.127 10160 Handling exposure in thread
21:49:01.267 01.207 10160 Exposure complete
21:52:59.959 238.692 10160 worker thread done servicing request
21:52:59.969 00.010 9544 Processing an image

There is a nearly 4 minute gap between line 2 and line 3 logged by phd2’s camera worker thread. The only thing that happens in phd2 between logging those two messages is: 1. your selected median noise reduction is applied to the image from the camera; 2. image pixels are scanned to determine screen stretch, and 3. a message is queued to the main thread. None of these operations should be able to block for 4 minutes!

Even if the main thread is unresponsive because SGP is hammering it with get_app_state requests, the camera worker thread should not block like that. I put together a test program to slam phd2 with get_app_state messages and it caused the UI to get very laggy, but did not cause the type of gaps that you saw.

Is it possible that the whole machine went unresponsive for 4 minutes? Can you tell us what type of machine it is – old? new? how much memory?

Andy

Bhwolf · February 17, 2015, 1:17am

Thank you Andy! Above and beyond as usual!

It’s a Lenovo W510, Core i7 (one of the earlier gen) with 8GB RAM and SSD, Win 8.1. Kind of old but still fairly performant. It is possible it froze up but unlikely. I did use the time lapse feature, only set to 125ms.

One thing I noticed; I use remote desktop to log into the machine to check in on things. I may do this from my iPad, or other laptop. I usually attribute sluggishness to screen painting, wifi, etc., but it did seem unusually slow yesterday. I chalked it up to everyone in the house on Netflix. To be sure, though, I looked at all the times I logged in remotely, and didn’t see any correlation with guide problems. (.93 is my laptop, .130 is the ipad)

2/15/2015 11:52:00 PM (.93)
2/15/2015 10:53:28 PM (.130)
2/15/2015 10:21:12 PM (.130)
2/15/2015 9:17:10 PM (.130)
2/15/2015 8:48:06 PM (.130)
2/15/2015 8:39:45 PM (.93)
2/15/2015 8:28:07 PM (.93)
2/15/2015 7:50:20 PM (.93)
2/15/2015 7:42:05 PM (.93)

I’m going to comb through the windows event logs and see if anything jumps out… Will report back!

Ken · February 17, 2015, 1:42am

While I am not sure what caused this issue (starts with SGPro not receiving any PHD messages for more than 10 seconds), I have located the code responsible for hammering PHD2. The good news is that you only see this if PHD2 has entered an error state. This behavior has been reduced to query for state once every 10 seconds.

Andy · February 17, 2015, 1:49am

I agree, that pretty much rules out a Windows resource hang-up. I actually have that same laptop though I’m running Win7 on it, and that could conceivably make a difference,

The correlation we do see is SGP slamming phd2 with get_app_state messages, though we do not know if that is the cause.

Would you be willing to try my test program on your machine to slam phd2 with messages? If so please contact me at my email address andy.galasso@gmail.com and I’ll send it to you.

Andy

Ken · February 17, 2015, 1:53am

True, but the genesis event is still unknown. SGPro’s bad behavior only starts after it has not received a message for 10 seconds. This means that the last message SGPro received was at 21:46:59 and starting at 21:47:09, over the next 5 seconds, SGPro proceeds to pummel PHD2 for status.

Bhwolf · February 17, 2015, 3:09am

Andy – email sent – thanks!

I haven’t found anything obvious yet in the logs, but will continue to poke around the machine. Another possibility is that it was the coldest night so far this year (a few degrees below freezing here in NC), so USB connections sometimes seem more flakey. The QHY5L-II isn’t exactly the most stable cam, either.

While I do update to the latest dev builds of both PHD and SGPro when available, historically the combination has been rock solid for me.

Andy · February 17, 2015, 4:41am

Yes, camera communication is always the first thing to question in these types of situations. We would definitely have seen evidence in the log if it was a camera communication problem, but in all cases the log showed the delay after the image was successfully downloaded from the camera.

Andy