Sites freezing in middle of update process

mathewcallaghan · September 8, 2019, 11:11pm

I agree with @ozfiddler an upgrade process that fails is not good. This is one of the reasons I like wp over drupal for example. It is a very solid foundation.

spanner44 · September 8, 2019, 11:12pm

I had a similar problem when I was trying to do the automatic update on my site, it got stuck on the unpacking stage. I waited a while but nothing happened, so I right clicked on the Dashboard tab and opened in a new window, which at the top reported an update was available.

When I the tried again, presuming there was an error, as the site never as far as I know went into maintenance mode, it said i had to wait 15 minutes as it was already updating or tried to (can’t remember the exact wording).

So I waited and tried again and it updated without any problems, so just presumed the fault was with the server i was on, I thought perhaps it was doing maintenance so my site temporary lost connection.

Anyway the update completed ok, so didn’t think anything of it.

timkaye · September 8, 2019, 11:30pm

Which is why I wrote “might”.

ozfiddler · September 8, 2019, 11:31pm

Did you get any errors in the logs.

I’ve just tried the troublesome site a few more times and I am now getting this:

[09-Sep-2019 08:45:02 Australia/Sydney] PHP Fatal error: require(): Failed opening required '/home/avmaorga/public_html/wp-includes/load.php' (include_path='.:/opt/alt/php72/usr/share/pear') in /home/avmaorga/public_html/wp-settings.php on line 19 [09-Sep-2019 08:46:03 Australia/Sydney] PHP Warning: require(/home/avmaorga/public_html/wp-includes/load.php): failed to open stream: No such file or directory in /home/avmaorga/public_html/wp-settings.php on line 19 [09-Sep-2019 08:46:03 Australia/Sydney] PHP Warning: require(/home/avmaorga/public_html/wp-includes/load.php): failed to open stream: No such file or directory in /home/avmaorga/public_html/wp-settings.php on line 19 [09-Sep-2019 08:46:03 Australia/Sydney] PHP Fatal error: require(): Failed opening required '/home/avmaorga/public_html/wp-includes/load.php' (include_path='.:/opt/alt/php72/usr/share/pear') in /home/avmaorga/public_html/wp-settings.php on line 19 [09-Sep-2019 08:47:02 Australia/Sydney] PHP Warning: require(/home/avmaorga/public_html/wp-includes/load.php): failed to open stream: No such file or directory in /home/avmaorga/public_html/wp-settings.php on line 19 [09-Sep-2019 08:47:02 Australia/Sydney] PHP Warning: require(/home/avmaorga/public_html/wp-includes/load.php): failed to open stream: No such file or directory in /home/avmaorga/public_html/wp-settings.php on line 19 [09-Sep-2019 08:47:02 Australia/Sydney] PHP Fatal error: require(): Failed opening required '/home/avmaorga/public_html/wp-includes/load.php' (include_path='.:/opt/alt/php72/usr/share/pear') in /home/avmaorga/public_html/wp-settings.php on line 19

ozfiddler · September 9, 2019, 12:13am

…and of course when the support team at Synergy tried it, it went through perfectly.

I give up.

james · September 9, 2019, 8:51pm

I have been thinking some more about this thread. First of all this is definitely not the experience we want people to have with ClassicPress, and of course I agree with this:

From the errors you posted, it looks like sometimes the update is aborting partway through the process. This may leave files in an inconsistent/broken state.

It is pretty difficult to identify an exact cause for this issue, because it doesn’t happen consistently even on the same server, as you’ve seen. I have also never seen this issue myself. However here are a few factors that are possibly relevant:

The upgrade process will attempt to set an execution time limit of 5 minutes, which should be much more than enough for an upgrade to complete under any circumstances, but hosts can choose not to honor this request which makes a failure more likely.
Exceeding the memory limit during an upgrade request is also a possibility.
Servers with slower download speeds (to download the upgrade package) or disk access (to unpack it) would take longer to process an update and therefore be more likely to experience a timeout.
Shared hosting servers with many clients will experience higher load, taking a bit longer to do just about everything, and therefore failures are more likely in these environments too.
The first update attempted on a given site would be more likely to fail, since none of the files being upgraded would be present in the server’s filesystem cache.
WordPress does partial upgrades where possible, only overwriting the files that need to be updated. We have not implemented this yet. (This would help with upgrades in between ClassicPress versions, but not so much with migration from WordPress as almost every file needs to be changed.)

This theory about the upgrade process being killed before it completes is definitely what is happening resulting in the above errors, but it does not explain everything. For example, I would expect any execution time or memory limit errors to appear in the error log also.

Still, the hosting environment is a huge factor in this type of issue, and one or more of the above factors coming into play would be enough to cause this. If you want to prove this for yourself then you can try the upgrade on a private server backed by solid-state storage and good network bandwidth (for example, Digital Ocean) and it will complete in 5-10 seconds every time. However this is not a feasible option for every site, and part of the appeal and success of WordPress and ClassicPress is that they can run almost anywhere.

So here is what I think we should do in order to improve the robustness of the upgrade process:

Launch the upgrade as a background process using an AJAX action. The initial action should just start the upgrade without waiting for a response, and it should write to a “journal” file which indicates its progress.
The upgrade page should poll for the current status of the upgrade using this “journal” file on the backend, and show progress messages via AJAX instead of during the course of a “normal” page load as it does today.
The upgrade page needs to be able to detect if the upgrade has “stuck” and re-initiate it. The upgrade process in the backend needs to know how to resume from a partial update.

This should fix any and all issues with upgrades aborting partway through and also output buffering making the upgrade appear to “freeze”. (If output compression is enabled at the server level I think this will also effectively cause output to be buffered until the page load finishes, which doesn’t play nice with the way the upgrade process currently works.)

A couple of open questions here:

When JavaScript is disabled, what should happen? Should the upgrade proceed as it does today, or should we block it altogether?
The above plan makes sense for user-initiated upgrades, but what about truly automatic upgrades via wp-cron? I have not seen any reports of “my site broke itself” so maybe this code path doesn’t actually need any changes?
This is a big chunk of work, so how do we schedule it? Should we delay v2 in order to improve the robustness of the upgrade process?

Finally another thing that would be helpful is a way to see this issue myself, @ozfiddler I will reach out to you separately about this.

ozfiddler · September 9, 2019, 11:05pm

Thanks for looking into this James.

Yes. I do think this is happening. On one of the sites that was causing problems (but I thought eventually worked) I have just had a Shield warning about a file that was not recognised. When I investigated I found it was the edit-comments.php file, which ended abruptly at line 292 (there should be 334 lines).

I’m not sure about it being a hosting environment issue. I have sites on “budget” hosting and “good” hosting and had problems on both, and also no issues on both.

ozfiddler · September 10, 2019, 12:09am

And I suspect this is related. I’m finding this sort of thing in my uploads folders:

invisnet · September 10, 2019, 1:00am

Let’s try to see what’s common between the sites that fail - plugins, size of uploads directory, that kind of thing.

I have an idea of what might be causing the problem, but it’s a bit esoteric so it makes sense to rule out the obvious stuff first.

ozfiddler · September 10, 2019, 1:08am

Yes, I’ve been thinking along those lines too but I can’t come up with anything much. The only common factor is that they all run Shield plugin. But I seem to recall I tried deactivating it on one site when there was a problem and it didn’t help.

I use very minimal plugins. On most sites it is my own utility plugin and Shield. Of course my own plugin was the first thing I suspected but one of the problem sites doesn’t even have it installed.

invisnet · September 10, 2019, 1:10am

OK, so maybe not plugins - what about themes? They can do everything a plugin can do.

I’m also quite curious about the amount of space the sites take on disk.

ozfiddler · September 10, 2019, 1:37am

One is running a custom built theme that was made by someone else years ago. The others use my own theme that is basically GeneratePress.

Sites are at most 150MB. On the good hosting there is 1GB available, on the cheaper hosting they have 500MB.

Good hosting has 1GB RAM, budget is 500MB.

invisnet · September 10, 2019, 2:02am

That’s a very big chunk of work. I’d suggest the time would be better spent on doing partial upgrades and making sure we get the signing right. We could spend a huge amount of time chasing after something that’s almost certainly a hosting issue, only to discover it’s something our changes don’t fix.

We’ve not changed the install/upgrade process more than strictly necessary for good reason - it may not be pretty, but it’s been debugged by brute force.I’d caution against changing anything unless we’re really, really sure we know what we’re doing and why.

mathewcallaghan · September 10, 2019, 2:19am

I did some testing and I can not reproduce the issue from a clean install of 1.0.1 to 1.0.2, this is with server resources set at the minimum CPU, Memory and disk space.

james · September 10, 2019, 2:25am

Yes, it is.

I agree we should do this, the only issue for me is this will not fix the same problem with migrations.

james · September 10, 2019, 2:27am

I think there is something about the “good” hosting that is not quite as good as expected, possibly one or more of the things I mentioned above, for example disk access speed combined with a hard execution limit enforced by the host.

Anyway the best thing we can do about this right now is collect information. If this is happening on any hosting providers other than Zuver and Synergy then please do let us know.

This is a good clue! As Tim said above, LiteSpeed may not be the cause of the issue itself (but still might be because it’s an intermittent issue). Or, if these providers do in fact share the same server configuration, it could be something about that.

Edit to add: yes, I was able to confirm that Zuver and Synergy are both part of VentraIP (ref1, ref2, ref3). To me this points strongly towards a server configuration issue that is common to these providers.

invisnet · September 10, 2019, 2:49am

True. However, we’ve either inherited a bug or misfeature from WP, or (more likely) it’s a hosting issue we’ll have to go to great lengths to work around; without being able to reproduce it on demand the odds of fixing it except by accident are pretty remote.

It’s not that I think we shouldn’t try to fix it, just that we should be completely sure what it is we’re trying to fix before diving into rewriting the whole install/update system.

ozfiddler · September 10, 2019, 2:59am

Yes, I still think that is the most likely. They run a very tight ship and I have been up against restrictions in the past caused by some of their over-zealous lockdown settings. Still, the settings should be the same across all servers so odd that only a few sites are triggered.

I was assuming that anyway, until @spanner44 also reported exactly the same issue. He is in France so I doubt he’s a Synergy customer.

Agree with this. I think we should just monitor the situation at this stage.

james · September 10, 2019, 3:05am

Not the same issue, what he described would be a failed download from GitHub which would happen any time the server has a network issue or just an intermittent failed connection, and then the upgrade should succeed when it is resolved. The upgrade process is designed to cope with this more common type of failure. Instead it seems like you are experiencing failures basically randomly throughout all parts of the upgrade process, and some of these are pretty bad like when a file is only partially written.

This kind of problem is sort of like “my car won’t start” for a mechanic, there are many different possible causes that all show similar symptoms.

Simone · September 10, 2019, 6:04am

I had the same trouble about two years ago with WP, but I wasn’t able to understand what was wrong.

Edit: on Aruba VPS, Italy.