Sites freezing in middle of update process

I have been thinking some more about this thread. First of all this is definitely not the experience we want people to have with ClassicPress, and of course I agree with this:

From the errors you posted, it looks like sometimes the update is aborting partway through the process. This may leave files in an inconsistent/broken state.

It is pretty difficult to identify an exact cause for this issue, because it doesn’t happen consistently even on the same server, as you’ve seen. I have also never seen this issue myself. However here are a few factors that are possibly relevant:

  • The upgrade process will attempt to set an execution time limit of 5 minutes, which should be much more than enough for an upgrade to complete under any circumstances, but hosts can choose not to honor this request which makes a failure more likely.
  • Exceeding the memory limit during an upgrade request is also a possibility.
  • Servers with slower download speeds (to download the upgrade package) or disk access (to unpack it) would take longer to process an update and therefore be more likely to experience a timeout.
  • Shared hosting servers with many clients will experience higher load, taking a bit longer to do just about everything, and therefore failures are more likely in these environments too.
  • The first update attempted on a given site would be more likely to fail, since none of the files being upgraded would be present in the server’s filesystem cache.
  • WordPress does partial upgrades where possible, only overwriting the files that need to be updated. We have not implemented this yet. (This would help with upgrades in between ClassicPress versions, but not so much with migration from WordPress as almost every file needs to be changed.)

This theory about the upgrade process being killed before it completes is definitely what is happening resulting in the above errors, but it does not explain everything. For example, I would expect any execution time or memory limit errors to appear in the error log also.

Still, the hosting environment is a huge factor in this type of issue, and one or more of the above factors coming into play would be enough to cause this. If you want to prove this for yourself then you can try the upgrade on a private server backed by solid-state storage and good network bandwidth (for example, Digital Ocean) and it will complete in 5-10 seconds every time. However this is not a feasible option for every site, and part of the appeal and success of WordPress and ClassicPress is that they can run almost anywhere.

So here is what I think we should do in order to improve the robustness of the upgrade process:

  • Launch the upgrade as a background process using an AJAX action. The initial action should just start the upgrade without waiting for a response, and it should write to a “journal” file which indicates its progress.
  • The upgrade page should poll for the current status of the upgrade using this “journal” file on the backend, and show progress messages via AJAX instead of during the course of a “normal” page load as it does today.
  • The upgrade page needs to be able to detect if the upgrade has “stuck” and re-initiate it. The upgrade process in the backend needs to know how to resume from a partial update.

This should fix any and all issues with upgrades aborting partway through and also output buffering making the upgrade appear to “freeze”. (If output compression is enabled at the server level I think this will also effectively cause output to be buffered until the page load finishes, which doesn’t play nice with the way the upgrade process currently works.)

A couple of open questions here:

  • When JavaScript is disabled, what should happen? Should the upgrade proceed as it does today, or should we block it altogether?
  • The above plan makes sense for user-initiated upgrades, but what about truly automatic upgrades via wp-cron? I have not seen any reports of “my site broke itself” so maybe this code path doesn’t actually need any changes?
  • This is a big chunk of work, so how do we schedule it? Should we delay v2 in order to improve the robustness of the upgrade process?

Finally another thing that would be helpful is a way to see this issue myself, @ozfiddler I will reach out to you separately about this.

5 Likes