Device goes offline and online frequent

I m using iHost gateway with some Zigbee sonoff and tuya devices!!! All of them randomly gone offline!!

IT IS A TERRIBLE EXPERIENCE!!!

Does some official person reading this forum?!

Really terrible? I know many more terrible real-life situations. I’m leaving out the ass frostbite, of course. You are exaggerating.

Oops, it seems we have an overly sensitive, context-unaware word police here, guys. Better watch out.

This is terrible.

Of course, revolutionary vigilance must always be maintained :boom:

I have the same problem on the last 3 weeks with all sonoff connected with ewelink and others connected with tuya/smart Life. They all go offline and returns online After a several minutes .
Sorry for my english.
Fabio from Italy.

Just following up on my post above. If the keep-alive timeout was happening at the server then it would happen for everyone and always. This is unlikely as not everyone complains. Therefore I think the timeout is happening elsewhere. The most likely place is in a NAT gateway in the ISP access network. Network Address Translation (NAT) is typically used to map between public addresses and private addresses. Port translation table entries are created each time a new TCP connection is made. If a table entry times out and is deleted then there is no way for a packet to be sent from the ewelink server to the Sonoff box. If the box doesn’t send keep-alive packets frequently enough then the NAT entries can time out.
To check of you have a NAT in your access network open a CMD window and type a trace route command such as “tracert www.google.com” and it will give the IP addresses of the routers in the route. If you see private addresses in the path beyond your local router there will be a NAT gateway in your network. Private IP addresses are usually 10.x.x.x or 192.168.x.x. Note that the first address in the list will be private as your own LAN is private and there is a NAT in your own local router. I doubt that the local router NAT is timing out though - there is usually no shortage of entries in a local access router so no incentive to time them out quickly.
Does anyone have a trace of the packets between a Sonoff or MHCOZY device and the ewelink server when the device is idle? I’d like to see the frequency of the keep-alives.

Good morning, I’m trying with a 4G modem from the same company as my FTTC connection. There appears to be no disconnect. This explains some things but casts doubt on others.
My problem now is being able to route the sonoff devices to the 4g gateway and leave the rest on the main FTTC connection.
The routing rules require knowing the destination address to direct traffic to a gateway but I have not found information regarding the online ewelink servers as a public IP.
I’m going crazy trying to figure out if there is a device that routes from a local IP to a local gateway. Let me explain better: I must say that certain IP adresses (sonoff) are directed towards the 192.168.1.2 (4g) gateway while the rest of the devices continue to work with the main 192.168.1.1 connection.
If anyone could help me I would be very grateful!

It would probably be possible, but only in OpenWRT could you create such combinations on one router… But why do it? I changed the set: Netgear R6350 + zte mf823 modem to: zte mf286d. The problem with temporarily changing the device status to offline has disappeared. By the way, I thought that to quickly diagnose the problem, you can try to use the router option on an Android phone. If, after turning off the router and setting the same SSID on the phone, Sonoff devices communicate more stably, the problem is in the router.

I think your digression is flawed. In my opinion, such forums serve the creative exchange of information, thanks to which we can solve our problems ourselves, while improving ourselves and keeping our household budget in check… :slight_smile:

But should we look for the source of the problem in such a situation? Such situations should be noticed by Itead cloud server administrators, because they generate a huge additional and unnecessary burden related to device reactivation multiplied by hundreds of thousands of users per hour currently having this problem…

@0o0o0o0 I’m investigating this further at the moment. Tracing what’s going on isn’t easy - I now have a Wi-Fi repeater and Access point with an Ethernet segment in the middle so I can monitor what’s on the Ethernet using a switch with a mirror port and Wireshark on a PC.
So, I can now see that the box sends a message to the server about every 150 seconds. I have yet to capture it being offline but that’s not so easy. If the NAT timeout theory is correct and everything is idle then it won’t go offline at all - the next keep-alive being transmitted from the box will re-create the NAT entry and off it’ll go again without the timeout being evident. It is only if the server tries to contact the box while the timeout has happened would the server not be able to contact the box. I have just got my switch with mirroring for a couple of days so I’ve yet to come to any conclusions and I’ll post again if I can confirm the NAT theory or otherwise.
One thing I didn’t realise until now - if your phone is on the same LAN as the box then the phone talks directly to the box to switch a port. The box then updates its state to the server. i.e. the direction of the traffic is from the box to the server which will refresh a timed-out NAT entry. So, to do my test to check if it’s timed out I’ve to switch off the Wi-Fi on my phone so the phone uses 4G to talk to the server. Now when I make a switch it is the server that updates the box. If you try this you will see that if you go via 4G the switching time is slightly slower than if you’re on the LAN.
By the way - the ewelink server I’m connected to is hosted in AWS in Frankfurt. At least they’re using a good quality hosting service.

1 Like

@0o0o0o0 after further investigation, I am fairly sure it is a NAT timeout issue.
To see if you’re having the same issue:

  1. Connect your phone to the cellular network. i.e. not to the same LAN as the box, easiest way is to disable the Wi-Fi on your phone.
  2. Switch a port on or off. This will send an update from the server and a response from the box to the server.
  3. Wait about 2 mins 10 seconds
  4. try to switch the port status again. the box will show offline and come back in about 20 seconds.

So, I believe my NAT entry is timing out after about 2 minutes. When it is timed out there is no path for the server to contact the box. Only after the 2 mins 30 second (150 seconds) timeout does the box send an update to the server does the NAT entry get refreshed.
This stacks up for me - the box sends a keepalive every 150 seconds but I think my NAT times out after 120 seconds. So, if the server is sending a message to the box there is a 20% chance there is no NAT entry available and it will fail. I think that is about right - I switch a light on and off at sunrise/sunset and I get about a 20% failure.
The box sends a full status update over TLS/TCP at 150 second interval and I am sure ewelink would be reluctant to increase the frequency too much as it would add a significant processing overhead. However, if they sent a simple TCP level keepalive at, say, 60 second intervals it would keep the NAT entries alive without a lot of extra processing overhead.

1 Like

Same issue here, thought I was going mad. Sonoff Micros have been going offline randomly since around the beginning of the year. No issues for years up till now and now they both go off randomly but have no issues with any thing else on the network

NAT?! What on earth has NAT got to do with this? You’re making up theories as if you were from the Ancient Aliens series.

Boys and girls, there’s no one thing that is going to resolve everyone’s issues here. What may be the cause for one and not another are all manifesting the same “offline issue”. Bottom line: No one thing is going to fix it for everyone. There is likely many or at least several different causes for this pain.

Yes, it would be very useful for everyone to continue to post good ideas and things they tried.

I have 50+ Sonoff devices. And more than 50, actually 70 more non-Sonoff devices. No point in going into the things that I tried and didn’t help. What solved nearly all my offline problems, both Sonoff and non-Sonoff was replacing my routers. And its not always simply a newer router. We tend to forget that we keep adding moe IoT devices without keeping count. Many, even the newest routers, i.e., Eero, Wyze, to name two, have limits to number of devices. These two brands both allow up to 75 device per node. HOWEVER, when you reach the 75 device limit, the 76th, 77th, etc, do not simply roll over to another node. They DO NOT always reconnect elsewhere. AND there’s no feature that allows you to assign a device to a closer node. You can easily overlook that you have exceeded the reliable persistent connection to your router with a subdevice or subordinate device that you can not assign to another router node.

I went through half a dozen before I stumbled on a solution for most of my offline issues. I found a TP-Link Deco AXE5400 mesh set. It has two features that solved nearly all my issues. 1) The limit for the 5400 is 200. That fixed the device limit. AND, when you get into the system settings there a feature that allows me to select devices and assign their connection to a specific node. NEVER have I seen this elsewhere. And it worked. And it stays assigned.

I was using some of the Sonoff SNZB-01 and SNZB-0x subdevices. They would never stay connected to the early version Sonoff Zigbee hub/bridge. I gave up on them. I finally got the Sonoff ZB Pro hub AND stopped using the SNZB-0x subdevices. BUT when the SNZB-01P and those version came out, I hopped on them. THEY WORK GREAT. I love the SNZB-01P wireless switches. I have them everywhere, use them mostly via Alexa routines, but can also use them with eWelink Scenes.

I encourage you to take a good look at your device limit on your router. And you can use a free tool called Fing to do a device count. Play around with it, don’t depend on one scan to find everything. Doesn’t tell you which node, but tells you your count.

Best of luck. Just sharing my experience.

Get yourself a Home Assistant server. This is the best to share with anyone :sunglasses:

I’ve tried that. I don’t like that HA “takes” the device away from the app where I have better luck managing the device. No worth it to me. And not everything I use is supported in HA. So, I would wind up with some in HA and `half not. Again, another no go for me.

1 Like

What are you taking about?!

1 Like

Very interesting analysis. I also tried to trace the network traffic on the router and the mf823 modem (it has the ability to connect via telnet and run tcpdump) and correlate this data with kernel timers, but it was beyond me. I even saved the exchange of tcp/ip messages when the disconnection problem occurred in order to send this data to Ewelink support, but I gave up on it, discouraged by the previous lack of response when I reported another problem with the RF bridge. Now I use the mf286d router, which does not have the problem.

1 Like

@0o0o0o0 @jam3
Guys, just an update on this. I managed to hack together a piece of Python code on a PC and an ethernet hub (not a switch!) and I was able to inject TCP duplicate ACKs into the stream and it fixed the problem. I wasn’t able to use a TCP keepalive as it seemed to upset the Sonoff device. So, the TCP duplicate ack is keeping a NAT or other table alive for the full 150 seconds between the Sonoff status update messages.
However, I am not yet sure where this table entry is timing out. I have a Kuwfi wi-fi repeater in the circuit and it maintains a NAT-like table. All packets coming from the repeater segment towards the central LAN have the same source MAC address, so the Sonoff MAC address is hidden inside the repeater. Packets heading back towards the Sonoff get the destination MAC replaced by looking up this table. I suspect it might be this table that is timing out. If so, it might be a lot simpler to fix by having something on the main LAN pinging the Sonoff on the remote LAN. I have yet to try this.
There is also the possibility that the NAT entry in my main router is timing out - I doubt it though.
It could be NAT in the ISP - the first hop towards the Internet is a private address so there is a NAT gateway in there too.
I’ll keep you updated - my prime suspect is the Kuwfi repeater.