Nástroje používateľa

Nástoje správy stránok


blog:odborny:2024-09-22-unicode_nfc_normalisation_for_rclone_on_macos

Unicode NFC normalisation for Rclone on macOS

Apple devices create all filenames in Unicode Decomposed Normalisation Form (NFD), while every other major OS uses Composed Normalisation Form (NFC). This makes you, as a Mac user, the bad guy, because it is you who is incompatible with the rest of the world.

In a nutshell, the problem is this: Whenever you create files with diacritics they will be copied to other devices with filenames stored as decomposed strings. This is a nonstandard for these OS'es, and you never know what problems that will cause.

This article presents my way of solving the problem by configuring Rclone to create all files in NFC (composed form) instead of NFD (decomposed form) – which is not at all that straightforward is it would seem.

Direct way to solve the problem

TL;DR: If you just want to solve the problem without actually delving into the problem and its technical details, simply follow the steps below. Otherwise, head over to Technical background.

Prereqs: download Rclone and macFUSE.1)

  1. Download the custom-made iconv library – this is actually Apple's own version of the library (which you have on your macOS), but with iconv base updated to the latest version and with support for surrogate pairs (this is actually important, because for example all emojis are surrogate pairs of characters).
  2. Tweak this library yourself, because it is not fully compatible with Apple's version. Specifically, open the file ./include/iconv.h.in and comment out these 6 code blocks (I list the lines below already commented):
    1. lines 69–71:
      //#ifndef LIBICONV_PLUG
      //#define iconv_open libiconv_open
      //#endif
    2. lines 79–81:
      //#ifndef LIBICONV_PLUG
      //#define iconv libiconv
      //#endif
    3. lines 85–87:
      //#ifndef LIBICONV_PLUG
      //#define iconv_close libiconv_close
      //#endif
    4. line 129:
      //#define iconv_open_into libiconv_open_into
    5. line 134:
      //#define iconvctl libiconvctl
    6. line 214:
      //#define iconvlist libiconvlist
  3. Compile and install the doubletweaked library using the following commands:
    make -f Makefile.utf8mac autogen
    ./configure --prefix=/usr/local
    make
    make install

    These are stated here by author of the tweaked version and are similar to the building commands of the original GNU libiconv. However, I added the –prefix=/usr/local parameter to the configure command (present in the GNU version, but not in the tweaked version), since I wasn't sure where the library would put itself without it.

    • Note that for the first building step, you need autoconf & automake (install it e.g. through MacPorts by running sudo port install autoconf automake)
  4. Now you have two Applestyle libiconv binaries on your system: the original (and old) Apple one in /usr/bin/iconv and the tweaked (and updated) one in /usr/local/bin/iconv.
    1. First, test that the new binary itself works with the following conversions:
      echo 🙂 | /usr/bin/iconv -f utf-8 -t utf-8-mac
      >echo 🙂 | /usr/local/bin/iconv -f utf-8 -t utf-8-mac
      > 🙂
    2. Second, test that the new dynamic library has a proper symbol table. If everything went right, typing the following command:
      $ nm -gU /usr/local/lib/libiconv.2.dylib

      should output something similar to left column, and not to the right column (the order of lines might wary, the symbol names are important):

      Correct (Applestyle) symbol table

      00000000000e3290 D __libiconv_version
      0000000000002ce0 T _iconv
      0000000000003430 T _iconv_canonicalize
      0000000000002d10 T _iconv_close
      00000000000016b0 T _iconv_open
      0000000000002d20 T _iconv_open_into
      0000000000003160 T _iconvctl
      0000000000003270 T _iconvlist
      0000000000015eb0 T _libiconv_set_relocation_prefix
      0000000000015dd0 T _locale_charset

      Incorrect (GNUstyle) symbol table

      00000000000e3290 D __libiconv_version
      0000000000002ce0 T _libiconv
      0000000000003430 T _iconv_canonicalize
      0000000000002d10 T _libiconv_close
      00000000000016b0 T _libiconv_open
      0000000000002d20 T _libiconv_open_into
      0000000000003160 T _libiconvctl
      0000000000003270 T _libiconvlist
      0000000000015eb0 T _libiconv_set_relocation_prefix
      0000000000015dd0 T _locale_charset
  5. Now, you need to update the Fuse library to search for the dynamicallyloaded libiconv library on the new place:
    1. First, check that Fuse actually looks for the library under /usr/lib/:
      $ otool -L /usr/local/lib/libfuse.2.dylib
      /usr/local/lib/libfuse.2.dylib:
        /usr/local/lib/libfuse.2.dylib (compatibility version 12.0.0, current version 12.9.0)
        /usr/lib/libiconv.2.dylib (compatibility version 7.0.0, current version 7.0.0)[other libraries]
    2. Now, change the path by using install_name_tool:
      sudo install_name_tool -change /usr/lib/libiconv.2.dylib /usr/local/lib/libiconv.2.dylib /usr/local/lib/libfuse.2.dylib
    3. Finally, check that the change is successful:
      $ otool -L /usr/local/lib/libfuse.2.dylib
      /usr/local/lib/libfuse.2.dylib:
        /usr/local/lib/libfuse.2.dylib (compatibility version 12.0.0, current version 12.9.0)
        /usr/local/lib/libiconv.2.dylib (compatibility version 7.0.0, current version 7.0.0)[other libraries]
  6. Done! From now on, whenever you create new file or update the name of existing one, Rclone will produce its filename in Unicode composed form.

Technical background

Historical aspect and the situation now

Historically, the way of encoding filenames2) on Mac OS X started to differ from other operating systems when Apple switched from HFS to HFS+ file system in 1998. HFS+ uses Unicode 3.2 to encode filenames and Unicode allows four different ways of storing these filenames on disk – so called "normalisation forms".

But while HFS+ enforces canonical decomposed normalisation form (NFD) for filenames, virtually all other common filesystems treat filenames simply as sequences of bytes – this is true both for Linux distros (ext*, ReiserFS) and for Windows filesystems (FAT with Long filenames support, NTFS).

However, standard filenamemanipulation libraries on both Linux and Windows normalize filenames to composed normalisation form (NFC), so while it is technically possible to create decomposed filenames on them, this does not usually happen unless the user specifically wants to do that and bypasses standard OS routines. The same is true for all web technologies (CSS, HTML, XML), where W3C specifically recommends to always use composed form.3)

When Apple switched from HFS+ to APFS file system in its devices back in 2017, the situation could have changed, since APFS adopts the same approach as other file systems and does not normalize filenames – instead, it treats them as "bag of bytes". However, just as on Linux distros and Windows, standard filenamemanipulation libraries on macOS do normalize filenames – but to NFD, so exactly in opposition to these other OS's. So the current situation with Apple devices is that while the APFS file system itself is normalisationagnostic, the underlying macOS routines standardly create decomposed filenames.4)

Problems raised by the difference in normalisation forms

This incompatibility of unicode normalisation forms between Apple and other devices creates three different problems:

  1. First, when other users on different OS's are renaming the files you created (or renamed), they need to press Backspace twice when they want to remove a letter with diacritics (e.g. á or ü). Renaming a filename with a lot of diacritics can become pretty lengthy process. And note, this applies also to you when you are accessing these clouds from a web client (i.e. your browser), since the recommended standard for web technologies is NFC, as proposed by W3C.
  2. Second, and for the same reason, the files encoded in NFD are often impossible to find on various webbased projects, since many libraries do not properly work with NFD, as it is de facto a nonstandard encoding. In short, anything you upload from Apple device to web will likely cause some kind of problems.
  3. Third, composed and decomposed sequences of characters do not look exactly the same. I do not know if this is related to particular fonts or whether it is a rendering issue of a browser, but the fact is that composed and decomposed characters with combining marks simply look different and the decomposed ones always look worse.

Solution, part one: using Rclone with iconv module

Rclone has a special switch o (--option) which will forward its parameters to the underlying macFUSE or FUSE-T system providing the mounting functionality of the remote system. This way, it is possible to order Fuse to load iconv module and have it automatically converting all filenames to NFC when they are moved to remote cloud.

This is achieved by converting the filenames to UTF8 from special UTF8MAC encoding, which is basically UTF-8 in decomposed form, but with some specifics of macOS's version of UTF-8 decomposition. Thus, the straightforward way to solve the problem when mounting the system with Rclone should be to use the following command:

$ rclone mount [] -o modules=iconv,from_code=UTF-8,to_code=UTF-8-MAC

This is what the the creator of Rclone himself recommends and it kind of works, but with problems.

Problem: Missing files when using iconv in Rclone

The first problem you will encounter immediately after adding the -o modules=iconv part is that some files start to be missing in the Rclone listing. This is because macOS uses its own “Appletweaked” implementation of iconv, which is: (1) very old; (2) nonstandard; and (3) it cannot convert significant parts of Unicode characters – for example, emoji 😢. The moment you use the -o modules=iconv parameter in Rclone mount command, you bring into play this broken iconv implementation – and with it the aforementioned problems. All three of these problems will be crucial in our attempt to deal with the problem.

Third problem: Apple's version of iconv is bugged and incomplete

The third point is actually causing the missing files – macOS's broken version of iconv simply cannot convert some parts of Unicode (most notably surrogate pairs used e.g. in emoji) and thus any program using it will be missing files. As was suggested elsewhere,5) you can test this simply by trying to convert any emoji between UTF8 and UTF8MAC encoding (:

$ echo 😀 | iconv -f UTF-8-MAC -t UTF-8
iconv: (stdin):1:0: cannot convert
$ echo 😀 | iconv -f UTF-8 -t UTF-8-MAC
�

First problem: Apple's version of iconv is very old

The first point is hard to solve because the executable itself resides in /usr/bin/iconv (the library within /usr/lib/libiconv.dylib), which is under SIP, so you cannot normally do anything with it, and the only way to update it is to actually update the whole macOS.

However, updating the macOS won't help you that much, since the standard library versions Apple provides are almost always very obsolete.6) In the case of iconv, versions supplied with different macOS'es are these:

macOS iconv
version release date version (Apple) version (library) release date7)
macOS Sequoia 15 20240916 libiconv-107/1098) FreeBSD libiconv 1.11[?] 20090303
macOS Sonoma 14 20230926 libiconv-102 FreeBSD libiconv 1.11[?] 20090303
macOS Ventura 13 20221024 libiconv-64 GNU libiconv 1.11 20060719
macOS Monterey 12 20211025 libiconv-61 GNU libiconv 1.11 20060719
macOS Big Sur 11 20201117 libiconv-59 GNU libiconv 1.11 20060719
macOS Catalina 10.15 20191007 libiconv-59 GNU libiconv 1.11 20060719
macOS Mojave 10.14 20180924 libiconv-51.200.6 GNU libiconv 1.11 20060719

So in a nutshell, despite what the internal Apple versioning says, all macOS'es still use libiconv 1.11 released back in 2006.

Second problem: Apple's version of iconv is non-standard

Lastly, the second point means that you cannot simply use standard GNU libiconv distribution of the library, because standard version of the library differs from Apple's native one in two respects.

First, it does not have the special UTF8MAC encoding. This effectively means that we cannot use the standard iconv library for the UTF8 ↔ UTF8MAC conversion which we wanted from the very beginning, so dropping the Apple version of the library defeats the purpose.

Second, the functions which the libiconv.2.dylib library exports are named differently, so they are incompatible with Apple's native version of the library. This can be seen by running the nm L tool (see documentationarchived here).

Solution, part two: updating iconv with UTF-8-MAC support

All this boils down to a solution summed up above in Direct way to solve the problem:

  1. We need an updated iconv library with UTF8MAC encoding, but since there are no updates to Apple library and standard library does not have UTF8MAC encoding, a hybrid library is needed.
  2. We need the other apps and executables to actually use our updated library, but since the system folders (/usr/bin/* and /usr/lib/*) are protected by SIP, we need to manually point these apps to the proper library version.

As for the first point, there is a patched iconv library by Fumiyas which combines Apple's iconv library supporting UTF8MAC encoding with the latest iconv library. Installing it allows you not only to convert between real UTF8 and UTF8MAC encodings, but it also supports surrogate pairs. You have to build it yourself or download built version from another user on GitHub.

As for the second point, this is solved by manually changing the iconv library path in the executable with the install_name_tool.

Further reading

Tools for manual conversion of filenames between NFC/NFD

Tools for analysing texts and filenames

Comments

1)
I have no idea whether this process works at all with FUSE-T, the kextless implementation of FUSE.
2)
filenames – that is, file or foldernames. I will use the term “file” to mean any inode, whether it is a file or a directory.
3)
NFC, to be more specific.
4)
As with Windows/Linux distros, it is possible to bypass standard macOS routines, and thus to create composed filenames under macOS – for example under Terminal – but this is not common.
6)
The Apple Open Source site provides listings of all of the open source software included in each release, together with their versions (these sometimes have some weird Applespecific versioning, but when you go to the respective GitHub page, you can usually dig out the actual software version there).
7)
See libiconv archive with downloads for each version.
8)
libiconv109 since macOS Sequoia 15.2
blog/odborny/2024-09-22-unicode_nfc_normalisation_for_rclone_on_macos.txt · Posledná úprava: 2025/04/16 19:55 od Róbert Toth