Discussion:
wide character filenames support
e***@public.gmane.org
2010-11-17 16:30:49 UTC
Permalink
Hi,

I have discovered an new bug on Kiwix for Windows:
Kiwix is not able to open filenames containing accents, something
containing "wikipédia" for example.

This is a Windows specific issue, if I do the same on GNU/Linux, it
works.

I think the explanation is that ext4 uses UTF8 as charset and NTFS,
UTF16... and a code point in UTF16 needs 2 bytes although in UTF8, it
needs only one.
As the method to open a ZIM file takes a char* as argument (so base on
1 byte) it works with UTF8.

This is a pretty new problematic for me and I simply try to understand
what need to be fixed.
So feel free to give here a feedback.

For example, this is not clear if we need absolutely an (additional)
method accepting wide chars to open such files in the zimlib?
This is also not clear for me if this is possible to build a generic
and portable solution here?

Regards
Emmanuel
Asaf Bartov
2010-11-22 13:25:39 UTC
Permalink
Hi.

Note that the bug exists in GNU/Linux as well -- it's just better hidden...
:)
UTF8 uses a _variable_ amount of bytes to encode a code point. Often a
single byte is enough. But if your filename includes very special
characters, such as an "em-dash" (–) or an IPA charachter such as *ʧ* --
then the character would take up two bytes, and for some obscure characters
it can be up to _four_ bytes.

So French accents fit in one byte, but some other characters do not. If I
had a ZIM file with such a character on GNU/Linux, the code would fail too.

We do need a portable solution. I don't know the right way to do it off
the top of my head, so perhaps someone else on the list can offer advice.
If no one can, I'm willing to figure it out myself.

Asaf
Post by e***@public.gmane.org
Hi,
Kiwix is not able to open filenames containing accents, something
containing "wikipédia" for example.
This is a Windows specific issue, if I do the same on GNU/Linux, it works.
I think the explanation is that ext4 uses UTF8 as charset and NTFS,
UTF16... and a code point in UTF16 needs 2 bytes although in UTF8, it needs
only one.
As the method to open a ZIM file takes a char* as argument (so base on 1
byte) it works with UTF8.
This is a pretty new problematic for me and I simply try to understand what
need to be fixed.
So feel free to give here a feedback.
For example, this is not clear if we need absolutely an (additional) method
accepting wide chars to open such files in the zimlib?
This is also not clear for me if this is possible to build a generic and
portable solution here?
Regards
Emmanuel
_______________________________________________
dev-l mailing list
https://intern.openzim.org/mailman/listinfo/dev-l
--
Asaf Bartov <asaf.bartov-***@public.gmane.org>
Emmanuel Engelhart
2010-11-22 19:49:08 UTC
Permalink
Post by Asaf Bartov
Note that the bug exists in GNU/Linux as well -- it's just better hidden...
:)
UTF8 uses a _variable_ amount of bytes to encode a code point. Often a
single byte is enough. But if your filename includes very special
characters, such as an "em-dash" (–) or an IPA charachter such as *ʧ* --
then the character would take up two bytes, and for some obscure characters
it can be up to _four_ bytes.
There is no issue I think with UTF8 neither with libzim nor with
Kiwix... and file names with em-dash. I have tested and it works. The
reason is I think that the kernel interprets the char* string directly
as UTF8 (ext3/4 is in UTF8).

But on Windows, this is not possible to interpret directly the char* as
UTF16, otherwise if you give a ASCII encoded path it won't work. So I
suppose, STL open() & co (or the kernel) make a charset conversion to
UTF16 before asking the filesystem.

So if you want to open a file with character not in the ASCII charset, I
suppose you have to use a special STL open() accepting wchar and give
the path directly in UTF16.

That is my theory.
Post by Asaf Bartov
So French accents fit in one byte, but some other characters do not. If I
had a ZIM file with such a character on GNU/Linux, the code would fail too.
Does not looks like :)
Post by Asaf Bartov
We do need a portable solution. I don't know the right way to do it off
the top of my head, so perhaps someone else on the list can offer advice.
If no one can, I'm willing to figure it out myself.
Yes, would be great. Tommi, your are the STL expert :)

Thanks for your feedback Asaf.
Emmanuel

Continue reading on narkive:
Loading...