How to deal with weird things in translations

anon66243189 · September 17, 2021, 8:19am

In Crowdin there are sometimes strange things in the translation

Example

Original text:
this is some text

Translation:
this is „some„ text

I am aware that „ stands for quotes, but, since
A) those quotes are not in the original and text
B) they shuold likely be real quotes ""
…how to deal with this?
Remove?
Change to ""?
Keep using „ even if it is not in the original text, and even if it is not a proper quote?

joyously · September 17, 2021, 3:12pm

I don’t think the entity is the right approach, especially if it was added by the tool.
To get the correct quotes for the language, you can use <q>some</q> but it’s best to keep markup out of translations, and a lot of theme authors bork the quotes with CSS (not knowing what they are doing).
There could be problems with having quotes in a translation, not just entering them in the tool but storing in the tool (or is it a flat file?) and then when it’s displayed hopefully it’s escaped if used in an attribute.

Basically, I don’t know and I would ask someone knowledgeable over at WP. Perhaps they have a doc?

anon66243189 · September 17, 2021, 4:44pm

It’s crowdin “tool” so I can not really ask in wp

It’s an error or speciality of that tool we use… let’s hope Someone has some insight in this or we’ll have to try to see “what happens when”

ElisabettaCarrara · September 17, 2021, 9:44pm

That is not an error. It’s a behaviour.
Crowdin is not a “translator” as Google translate.
It’s a CAT tool (computer assisted translation).
It makes use of data from various sources to put together the best suggested translation to be checked and revised.
What you have seen is probably a formatting tag.
(CAT tools use tags to render certain formatting options, links and other things).
Usually tags come in “pairs” (opening and closing with text in the middle) but in certain cases they come as a single tag included in a line of text.
Another thing that may be is happening (and happens also in translate WP sites) is a different character encoding. That case should be dealt as in the translate WP sites.
To say which is the case I need to see the culprit line to be sure of what the tool is doing there. Generally tags are highlighted, but sometimes can be broken and shown as plain text with no highlight.

anon66243189 · September 18, 2021, 1:20am

I posted the precise text and code that is shown above in the message

It’s clear that it is a “” apostrophe
But it is missing in the source text

So to me that’s not clear what should happen or what it is or why it even is there
I’ve worked with CAT before and they generally don’t “add things where there aren’t any things”

For examples, you can query any language in crowdin and filter by “space issues”
All issues mentioned that say there is a “;” after the space (or none) are affected by this
It’s at least a couple hundred strings

The main question is… leave the encoded part, or use “” or use nothin instead

james · September 18, 2021, 2:02am

Do you have a screenshot of this string with its proposed translation? If you go into Proofreading mode then you get a nice display that also shows the place where it appears in the core code, which can sometimes be illuminating for unexpected translations.

Unless there is some reason for the quotes being different between the source and the translation, then I agree it’s probably best to remove the quotes from the translation. The most likely explanation is a translation error that we inherited from WP, and this is exactly the kind of thing that the process described at Approving and cleaning existing strings on Crowdin is designed to find and fix.

Otherwise the fact that HTML entities appear in translations is often normal and correct. Here’s an example of this being done correctly in Crowdin:

Finally if translating into a language that uses a different style of quotes then that should be reflected directly in the translated string - no need to use a hack like <q> for that.

anon66243189 · September 18, 2021, 3:40am

In your screenshot the entities is in both original and translations
Thus all good

That’s not the case for hundreds of strings where only proposed translations have the entities

I’ll add a screenshot ASAP, it happens in all languages and is visible by filtering “space” issues, it’ll jump to the eye immediately

On mobile right now and will add a screen + link ASAP

anon66243189 · September 18, 2021, 4:17am

OK so example:
wp-admin/nav-menus.php:546:
$editing_menus .= '<ul><li>' . __( 'Add one or several items at once by selecting the checkbox next to each item and clicking Add to Menu' ) . '</li>';

There are no weird entities in this string at all.

Now check crowdin for that string in German language:

The punctuation warning is OK, the “Translation does not have space after ;” is not OK and is due to the unexpected „ appearing in the translation but not in the source
This issue repeats for hundreds of strings.

Theright thing is to have apostrophes here because the string should read 'Add one or several items at once by selecting the checkbox next to each item and clicking "Add to Menu", to denote the “Add To Menu”, but it is not the case in the source and thus, there is no reason at all for it to appear in a proposed translation.

The issue is not limited to German.

the only I can imagine is that the original translation PO uploaded does have those apostrophes but those get badly formatted by crowdin (since we probably shouldn’t encode them?)
I am not sure.

What I am sure, is that we need to document an approach to be taken for those, and then apply the same rule in all cases.
IMO, if the original has no apostrophe the translation should not either

With the exception of things like “L’immagine” or such things, but, that is also not a cool way to write Italian, its strictly speaking lazy grammar, proper school grammar is “La immagine”.
Only that no one actually uses the long form… so again here we need a rule as of how those apostrophes must be encoded or written out plain?

PS:
<q> or any other HTML in general should not work in a translation if properly escaped.
So unless the original source has HTML in it, we should never add HTML to a translation. It’s the whole point of esc_html_e or esc_html__ to remove those tags, and it is is (or should?) be widely used on localised strings.
Only if not possible in any other way HTML should be localised.
I know that core does literally not care about this and does often not even escape strings that are localised, but dare you deliver a plugin with such string!
So we should probably just not do it in core either, being it the “be an example” tool/source and all that

anon66243189 · September 20, 2021, 11:59am

@James or @ElisabettaCarrara do you have any suggestions here?

These translations are basically blocked until we figure this out, in all languages that somehow use apostrophes or else special “signs”

Cheers!

ElisabettaCarrara · September 20, 2021, 12:47pm

Usually in a cat tool while translating multiple versions you would allow all the tags that differentiate from original by ignoring the warning.
This because for example THE CAR is L’AUTOMOBILE in Italian so basically the issue is the cat is not expecting a tag in the Italian locale versus the English original.
There are also times a tag is present in the original but not in the localized version, or times where both have tags but not matching.
One problem I see with this is that being Italian I can’t discern if a tag issue in Japanese is a “ignore” or a real issue.

anon66243189 · September 20, 2021, 1:14pm

And we’d use actual apostrophe or the encoded thing?

james · September 20, 2021, 5:22pm

This also happens but it is not the same as the issue that is being discussed here.

Yep, I agree. I think we should log the issue with the source string as a comment at Translations: Issues with source strings · Issue #109 · ClassicPress/ClassicPress-v1 · GitHub (we will probably want a better way to track this, but one issue per string seems like it will get out of hand fast).

Then, regarding what to do about the translation, I think I could argue it either way. The translation could be edited to strictly match the source string and remove the warning. However, the translation is actually an improvement over the source string, so it could be left alone and approved as-is.

So maybe that’s the rule to adopt: log any issues with the source strings, and if the translation is definitely an improvement over the source string then let it go.

The string in the original .pot file is reflected accurately: https://github.com/ClassicPress/i18n-core-crowdin/blob/5507797f11b4f92f08f594e7845a77e4f72950a9/en_US.pot#L12936-L12938

Also, HTML entities definitely should be encoded (as they are today) for core translations. This avoids any possible character set issues. We generally do not escape translations in core and we should not try to start until maybe one day when we have a mature translation infrastructure that is equipped to handle this issue properly.

This would only be a legitimate security issue if it were somehow possible for bad stuff to get into translations. At the moment it’s much easier to just make sure that doesn’t happen than to rewrite all the places where entities and tags are used in translated strings, so we will instead need to create a process that rejects translations containing anything unexpected, and the Crowdin warnings + required approval for new translations will play a part there.

StonehengeCreations · April 15, 2024, 12:43pm

I thought hard about reviving this topic, but I decided to do so, because I also run into the same issues.

My personal feeling is that the original core strings very often lack readability by not using quoted strings when referencing to a link or button.

For example:
The View Post link leads to that post on your live site. would translate into Dutch like: De link Bericht bekijken leidt naar dat bericht op je live site.

Honestly, something like De link "Bericht bekijken"... would it much easier for users to read and understand such explanations.

It is even worse if your look at the complete string of this specific example:
In the In response to column, there are three elements. The text is the name of the post that inspired the comment, and links to the post editor for that entry. The View Post link leads to that post on your live site. The small bubble with the number in it shows the number of approved comments that post has received. If there are pending comments, a red notification circle with the number of pending comments is displayed. Clicking the notification circle will filter the comments screen to show only pending comments on that post.

ElisabettaCarrara · April 15, 2024, 1:59pm

I am aware, the quotes might not be used for specific reasons:

one - escaping them in code and the fact that in PHP " " and ’ ’ have their own meaning.

two - the TAGS in CAT tools sometimes misinterpret them

three - the strings date back to when WP was born and the way the localization was handled in the early days was that independent devs would localize their locale and make it available in forums after months from a WP release, each of them had their way of managing strings and there was not an uniformity of opinion regarding the matter you are highlighting now, that meant that the problem was ignored. Then WP decided to centralize localization without resolving these issues, just going with what was there. Now we have inherited the hot mess and it is an occasion to do better.

So YES. We can definitely cleanup how strings are organized, served, formatted and included in core. One issue at the time. That is why we need to release the first translations to gain awareness of the areas we need to work on to improve the whole system.

Then our team can decide on a plan and execute. Rome wasn’t built in a day, but we will get there.

Thanks for being so attentive and pointing out all the issues, I am keeping tabs and making a list that we would then need to discuss as a team with all the other people helping in localization. We will need to establish priority (to me the formatting issue takes priority after the first translations release, but it’s just me) and see if the other contributors can help in making the change happen (because changing things like formatting would surely mean going into the files and going over each and every string and is an huge undertaking)

StonehengeCreations · April 15, 2024, 3:25pm

YEEY for ClassicPress!

I have no idea what “TAGS in CAT tools” is, so I cannot comment on that.
But I do know that using single quotes in mo files (the translated strings) does not break the code as current WP translations contain them without any problems. Using single (or double) quotes in the source code is something quite different, of course…

Personally I will pause translating V2 into Dutch, for now, as the pot file is very likely to be split up into three files anyway. I will kindly wait until the main road is clear.

ElisabettaCarrara · April 15, 2024, 3:46pm

About Tags, in Crowdin or any other computer assisted translation tool (OmegaT, MateCat, Poedit and all the others…) whenever there is something “notable” like formatting (bold for example), or hardcoded strings, links, HTML entities, special symbols or things like that a couple of “tags” (very similar to HTML that has TAGS) are used to encase it (usually they contain info on what is within them and their specific format tells the CAT tool that this specific content has to be preserved in its original format so that when the translated strings get to where they need to go they retain the info their format brings with them, so bold remains bold etc).

ClassicPress inherited a STRING MESS because the i18n WP team is not lead by people who really know CAT tools or know localization processes; do not get me wrong, a translator knows more of the ins and outs of how to localize than a dev and to localize software both a dev and a translator are needed - some devs know both localizing and translating and some translators can code but the majority of people contributing to WP aren’t devs or translators unfortunately, they are just end users with no prior experience. It is a “community effort” where everybody is welcome on the WeGlot Platform and the team leads do not have knowledge or access to normalize strings and solve the issues. This means each string needs to be carefully revised in their system at least four times by 4 different people with different access and the problems continue to be there, they just ignore them and do the best they can with what they have.

Me thinks there is no need for you to hold back translating because me thinks it is possible to automate the splitting (I am looking into it with Crowdin help docs) - so that we translate one file and then we have an automatic process splitting it the way we need before release.

I will update you if this is easily possible, because it could be a way for us to simplify localizing without touching core (because the released locales would be served the old way to it)

ElisabettaCarrara · April 15, 2024, 3:57pm

Source files in your Crowdin project will be up-to-date with the selected branches in your repository. Ready translations will be automatically pushed as a pull request to your GitHub repository. So, make sure you have sufficient permissions in GitHub to set up this integration.

Learn more

Tips for best practices:

Translations aren’t committed directly to the specified branches. They are pushed to the automatically created service branch - l10n. With this approach, you can review translations before merging them to the target branch and ensure that translations won’t be merged without your prior approval.
Translations from GitHub are pulled when you set up the integration. After, it is expected that all translations will be made in Crowdin, so we won’t check for new translations made in GitHub.
We recommend deleting the l10n branch when merging translations to avoid possible future conflicts. Once new translations are made in your Crowdin project, a service branch will be automatically created again.

For context, since Crowdin pushes the locales to the automatic branch for i18n as pull requests to be revised and merged, and after merge of all the locales and release such repo must be deleted and gets automatically recreated when there are new translations (and as explained we can set it to source new strings directly and to push the PRs to that repo) we can IMHO set up to merge them and perform the splitting in the develop branch PRIOR TO RELEASING so no need to split them in Crowdin.

@MattyRob As of now there is one CP repo connected so that me think we are good to go in this process. Do you think this woirkflow could work?

MattyRob · April 15, 2024, 6:58pm

@Elisabatta, I’ve been looking at a process today to split and automate creating three POT files to match what is needed by the core code.

My belief is if we change the GitHub repository and re-sync it any current translations would be lost, but could be recovered by uploading a PO file already created and downloaded from Crowdin - is that correct? If so, then I think translation work can proceed.

ElisabettaCarrara · April 15, 2024, 8:21pm

your belief is correct that we are going to lose localization and that simply by re-uploading the locales will solve the issue.

What we need as GitHub is a connection to a repo where Crowdin can take the source strings and then create an automatic branch for i18n where the translated ones are pushed as PRs.

We would need one for CP v1 and one for CP v2 I think since if I understand correctly they live in two different repos. Then I will be able to set the TMS (translation memories) that Crowdin uses to pretranslate correctly since the one for v1 goes with the v1 repo and the v2 with v2 repo.

Then Crowdin assumes that the repo where the software lives is the source (so that if new strings are added or strings are removed or changed only those changes get ported in Crowdin and all localizers get a notification that changes need to be translated/revised/approved, and the release that is generated pertains only to the changes made).

A notable thing to remember: Crowdin creates the PRs in a i18n repo, when you decide to merge translation into develop to do a release for CP you have to destroy that branch and Crowdin recreates it again when needed - that is necessary to avoid confusion.