How to deal with weird things in translations

In Crowdin there are sometimes strange things in the translation

Example

Original text:
this is some text

Translation:
this is „some„ text

I am aware that „ stands for quotes, but, since
A) those quotes are not in the original and text
B) they shuold likely be real quotes ""
…how to deal with this?
Remove?
Change to ""?
Keep using „ even if it is not in the original text, and even if it is not a proper quote?

I don’t think the entity is the right approach, especially if it was added by the tool.
To get the correct quotes for the language, you can use <q>some</q> but it’s best to keep markup out of translations, and a lot of theme authors bork the quotes with CSS (not knowing what they are doing).
There could be problems with having quotes in a translation, not just entering them in the tool but storing in the tool (or is it a flat file?) and then when it’s displayed hopefully it’s escaped if used in an attribute.

Basically, I don’t know and I would ask someone knowledgeable over at WP. Perhaps they have a doc?

1 Like

It’s crowdin “tool” so I can not really ask in wp :persevere:

It’s an error or speciality of that tool we use… let’s hope Someone has some insight in this or we’ll have to try to see “what happens when”
:crossed_fingers:

That is not an error. It’s a behaviour.
Crowdin is not a “translator” as Google translate.
It’s a CAT tool (computer assisted translation).
It makes use of data from various sources to put together the best suggested translation to be checked and revised.
What you have seen is probably a formatting tag.
(CAT tools use tags to render certain formatting options, links and other things).
Usually tags come in “pairs” (opening and closing with text in the middle) but in certain cases they come as a single tag included in a line of text.
Another thing that may be is happening (and happens also in translate WP sites) is a different character encoding. That case should be dealt as in the translate WP sites.
To say which is the case I need to see the culprit line to be sure of what the tool is doing there. Generally tags are highlighted, but sometimes can be broken and shown as plain text with no highlight.

1 Like

I posted the precise text and code that is shown above in the message

It’s clear that it is a “” apostrophe
But it is missing in the source text

So to me that’s not clear what should happen or what it is or why it even is there
I’ve worked with CAT before and they generally don’t “add things where there aren’t any things”

For examples, you can query any language in crowdin and filter by “space issues”
All issues mentioned that say there is a “;” after the space (or none) are affected by this
It’s at least a couple hundred strings

The main question is… leave the encoded part, or use “” or use nothin instead

1 Like

Do you have a screenshot of this string with its proposed translation? If you go into Proofreading mode then you get a nice display that also shows the place where it appears in the core code, which can sometimes be illuminating for unexpected translations.

Unless there is some reason for the quotes being different between the source and the translation, then I agree it’s probably best to remove the quotes from the translation. The most likely explanation is a translation error that we inherited from WP, and this is exactly the kind of thing that the process described at Approving and cleaning existing strings on Crowdin is designed to find and fix.

Otherwise the fact that HTML entities appear in translations is often normal and correct. Here’s an example of this being done correctly in Crowdin:

Finally if translating into a language that uses a different style of quotes then that should be reflected directly in the translated string - no need to use a hack like <q> for that.

In your screenshot the entities is in both original and translations
Thus all good

That’s not the case for hundreds of strings where only proposed translations have the entities

I’ll add a screenshot ASAP, it happens in all languages and is visible by filtering “space” issues, it’ll jump to the eye immediately

On mobile right now and will add a screen + link ASAP

OK so example:
wp-admin/nav-menus.php:546:
$editing_menus .= '<ul><li>' . __( 'Add one or several items at once by <strong>selecting the checkbox next to each item and clicking Add to Menu</strong>' ) . '</li>';

There are no weird entities in this string at all.

Now check crowdin for that string in German language:

The punctuation warning is OK, the “Translation does not have space after ;” is not OK and is due to the unexpected &#8222; appearing in the translation but not in the source
This issue repeats for hundreds of strings.

Theright thing is to have apostrophes here because the string should read 'Add one or several items at once by <strong>selecting the checkbox next to each item and clicking "Add to Menu"</strong>, to denote the “Add To Menu”, but it is not the case in the source and thus, there is no reason at all for it to appear in a proposed translation.

The issue is not limited to German.

the only I can imagine is that the original translation PO uploaded does have those apostrophes but those get badly formatted by crowdin (since we probably shouldn’t encode them?)
I am not sure.

What I am sure, is that we need to document an approach to be taken for those, and then apply the same rule in all cases.
IMO, if the original has no apostrophe the translation should not either

With the exception of things like “L’immagine” or such things, but, that is also not a cool way to write Italian, its strictly speaking lazy grammar, proper school grammar is “La immagine”.
Only that no one actually uses the long form… so again here we need a rule as of how those apostrophes must be encoded or written out plain?

PS:
<q> or any other HTML in general should not work in a translation if properly escaped.
So unless the original source has HTML in it, we should never add HTML to a translation. It’s the whole point of esc_html_e or esc_html__ to remove those tags, and it is is (or should?) be widely used on localised strings.
Only if not possible in any other way HTML should be localised.
I know that core does literally not care about this and does often not even escape strings that are localised, but dare you deliver a plugin with such string!
So we should probably just not do it in core either, being it the “be an example” tool/source and all that

@James or @ElisabettaCarrara do you have any suggestions here?

These translations are basically blocked until we figure this out, in all languages that somehow use apostrophes or else special “signs”

Cheers!

1 Like

Usually in a cat tool while translating multiple versions you would allow all the tags that differentiate from original by ignoring the warning.
This because for example THE CAR is L’AUTOMOBILE in Italian so basically the issue is the cat is not expecting a tag in the Italian locale versus the English original.
There are also times a tag is present in the original but not in the localized version, or times where both have tags but not matching.
One problem I see with this is that being Italian I can’t discern if a tag issue in Japanese is a “ignore” or a real issue.

1 Like

And we’d use actual apostrophe or the encoded thing?

This also happens but it is not the same as the issue that is being discussed here.

Yep, I agree. I think we should log the issue with the source string as a comment at https://github.com/ClassicPress/ClassicPress/issues/569 (we will probably want a better way to track this, but one issue per string seems like it will get out of hand fast).

Then, regarding what to do about the translation, I think I could argue it either way. The translation could be edited to strictly match the source string and remove the warning. However, the translation is actually an improvement over the source string, so it could be left alone and approved as-is.

So maybe that’s the rule to adopt: log any issues with the source strings, and if the translation is definitely an improvement over the source string then let it go.

The string in the original .pot file is reflected accurately: https://github.com/ClassicPress/i18n-core-crowdin/blob/5507797f11b4f92f08f594e7845a77e4f72950a9/en_US.pot#L12936-L12938

Also, HTML entities definitely should be encoded (as they are today) for core translations. This avoids any possible character set issues. We generally do not escape translations in core and we should not try to start until maybe one day when we have a mature translation infrastructure that is equipped to handle this issue properly.

This would only be a legitimate security issue if it were somehow possible for bad stuff to get into translations. At the moment it’s much easier to just make sure that doesn’t happen than to rewrite all the places where entities and tags are used in translated strings, so we will instead need to create a process that rejects translations containing anything unexpected, and the Crowdin warnings + required approval for new translations will play a part there.

1 Like