How a few bytes completely broke my production app

9 min read1 day ago

Over the last 9 years I’ve been working in my free time on a passion project called GDLauncher (https://gdlauncher.com), and today we’re gonna analyse a bug in our codebase that resulted in a crash for tens of thousands of users worldwide.

What is GDLauncher

Very briefly, GDLauncher is a custom Minecraft launcher written from scratch in Rust and SolidJS.

I started it as a fun side project in 2015 in C#, and it has been rewritten multiple times over the years in many different languages as a way for me to get better at programming. Since 2022 it’s become a profitable company and I’ve been working on it almost full-time.

The Problem

Every now and then I spend some time going over all my Sentry issues, and this one was a pretty interesting one, so I thought I would make an article on it so other people can avoid the same mistake in the future and learn from my mistake.

This is what I was presented with on the Sentry console. For anyone experienced with characters encoding this will immediately ring a bell in your head, but for everyone else, I’m gonna explain exactly what’s going on here.

This is happening when a user creates a new Minecraft instance on the app. An instance is basically just a folder that contains all the data for a specific Minecraft version, so you can have multiple Minecraft versions and modloaders installed at the same time without conflicts.

When clicking “Create”, Javascript would send an IPC event to our Rust process that will handle the event, but since the Rust process panics, the entire app will crash and no longer work. That’s a pretty big deal, especially when this applies to many thousands of users.

Some context on the issue

When receiving the event, the Rust process will immediately call a function called “next_folder” which, given a starting name, will find a valid and sanitised name for the folder of the instance. If a folder with that exact name already exists, it will postpone a number (instance, instance_1) and will handle other edge cases, such as using reserved filesystem names and sanitisation.

The instances filesystem tree would look something like

- instances
  - instance 1
    - instance.json (config file)
    - instance
      - ... minecraft files & directories
  - instance 2
    - instance.json (config file)
    - instance
      - ... minecraft files & directories
  - instance 3
    - instance.json (config file)
    - instance
      - ... minecraft files & directories

The folder name is not particularly important for the user as the actual instance name is stored in a config JSON file inside the folder itself, so the folder name is mainly a way for users to manually navigate through them.

Now, this is what our next_folder function looks like (you don’t really need to understand it).

async fn next_folder(self, name: &str) -> anyhow::Result<(String, PathBuf)> {
    if name.is_empty() {
        bail!("Attempted to find an instance directory name for an unnamed instance");
    }

    #[rustfmt::skip]
    const ILLEGAL_CHARS: &[char] = &[
        // linux / windows / macos
        '/',
        // macos / windows
        ':',
        // ntfs
        '\\', '<', '>', '*', '|', '"', '?',
        // FAT
        '^',
    ];

    #[rustfmt::skip]
    const ILLEGAL_NAMES: &[&str] = &[
        // windows
        "con", "prn", "aux", "clock$", "nul",
        "com1", "com2", "com3", "com4", "com5", "com6", "com7", "com8", "com9",
        "lpt1", "lpt2", "lpt3", "lpt4", "lpt5", "lpt6", "lpt7", "lpt8", "lpt9",
    ];

    // trim whitespace (including windows does not end with ' ' requirement)
    let name = name.trim();
    // max 28 character name. this gives us 3 digits for numbers to use as discriminators
    let name = &name[0..usize::min(name.len(), 28)];

    // sanitize any illegal filenames
    let mut name = match ILLEGAL_NAMES.contains(&(&name.to_lowercase() as &str)) {
        true => format!("_{name}"),
        false => name.to_string(),
    };

    // stop us from making hidden files on macos/linux ('~' disallowed for sanity)
    if name.starts_with('.') || name.starts_with('~') {
        name.replace_range(0..1, "_");
    }

    // '.' disallowed when ending filenames on windows ('~' disallowed for sanity)
    if name.ends_with('.') || name.ends_with('~') {
        name.replace_range(name.len() - 1..name.len(), "_");
    }

    let mut sanitized_name = name
        .chars()
        .map(|c| match ILLEGAL_CHARS.contains(&c) {
            true => '_',
            false => c,
        })
        .collect::<String>();

    let mut instance_path = self
        .app
        .settings_manager()
        .runtime_path
        .get_instances()
        .to_path();

    // cant conflict with anything if it dosen't exist
    if !instance_path.exists() {
        instance_path.push(&sanitized_name);
        return Ok((sanitized_name, instance_path));
    }

    if !instance_path.is_dir() {
        bail!("GDL instances path is not a directory. Please move the file blocking it.")
    }

    let base_length = sanitized_name.len();

    for i in 1..1000 {
        // at this point sanitized_name can't be '..' or '.' or have any other escapes in it
        instance_path.push(&sanitized_name);

        if !instance_path.exists() {
            return Ok((sanitized_name, instance_path));
        }

        instance_path.pop();

        sanitized_name.truncate(base_length);
        sanitized_name.push_str(&i.to_string());
    }

    bail!("unable to sanitize instance name")
}

That’s quite a function! I know. Do you see anything clearly wrong with it? Yeah, I couldn’t either at first glance, but after starting at it for a bit the issue became very clear.

The actual part we will focus on is the section where we take the instance name, slice it to only 28 characters, and reserve the last 3 (for a total of 31 max) for postponed discriminator numbers.

let name = &name[0..usize::min(name.len(), 28)];

What this does is it takes the “name” which is basically a string, and truncates it to either 28 “characters” or keeps it as is if it’s shorter, but there’s a catch. What it ACTUALLY does is it truncates it to 28 BYTES, not characters. While most of the time they are the same thing, that’s not always the case.

More Context

Strings in most languages are defined as a sequence of contiguous bytes in memory. Sometimes they give you the actual value (like Javascript), sometimes a (fat) pointer to it (like Rust) but the underlying representation is very similar.

Let’s break down how the “Cozy Cottage 𝘸𝘪𝘵𝘩 𝘴𝘢𝘶𝘤𝘦 🧂” text is stored in memory. This is the actual bytes that represent this string.

43 6f 7a 79 20 43 6f 74 74 61 67 65 20 f0 9d 98 b8 f0 9d 98 aa f0 9d 98 b5 f0 9d 98 a9 20 f0 9d 98 b4 f0 9d 98 a2 f0 9d 98 b6 f0 9d 98 a4 f0 9d 98 a6 20 f0 9f a7 82

While the text is only 25 characters long, it’s represented by 55 UTF-8 bytes. How can that be?

In the past, strings used to be encoded in ASCII, meaning each character was assigned 1 byte, supporting up to 128 characters. With the expansion of computers to more and more countries (especially Japan), there was a need to encode more and more characters, way more than the original 128. This led to the creation of a encoding standard called Unicode, where each character (or emoji) corresponds to a specific sequence of bytes, depending on the encoding used (UTF-8, UTF-16..).

Rust (and many other programming languages) mainly supports UTF-8 out of the box, so when we save a string, Rust will convert it to UTF-8 bytes and save them in an underlying vector.

While some characters (like the ASCII characters) can be represented by a single byte, some others need multiple bytes (up to 4 in UTF-8).

Below you can see a mapping of each character of the string, to the corresponding bytes sequence that represents it in UTF-8.

When executing

let name = &name[0..usize::min(name.len(), 28)];

we are trying to truncate “Cozy Cottage 𝘸𝘪𝘵𝘩 𝘴𝘢𝘶𝘤𝘦 🧂” to 28 bytes, which means truncating it in the middle of the “𝘩” character of the “𝘸𝘪𝘵𝘩” word, that is represented by a bytes sequence of 4 bytes, resulting in a incomplete and invalid UTF-8 string, hence our panic.

So instead, what we might try to do is loop over the actual bytes sequence, something like this.

let name = name.chars().take(28).collect::<String>();

Now, while this might work with this specific input, it’s still not what we’re looking for. That’s because here we are iterating over byte sequences representing code points, but some characters are made up of more than one!

Let’s talk some terminology

A Character is an overloaded term that can mean basically anything.
A code point is a specific number that is given meaning by the unicode standard. In our case each byte sequence would have a corresponding code point.
A code unit is a part of an encoded code point, depending on the encoding. For example in our case the salt emoji (🧂) is made up of 4 code units (f0 9f a7 82), but in UTF-16 that would be different.
A grapheme is a sequence of one or more code points displayed as a single graphical unit recognised as a single element.

For example the small latin e with an acute accent (é) is represented as U+0065(e | 1 byte | [101]) + U+0301(◌́ | 2 bytes | [204, 129]).

let s = "é";

s.as_bytes(); // [101, 204, 129] | 3 bytes, as we just saw
s.as_bytes()[0] as u32 // 101
s.as_bytes()[1] as u32 // 204

s.chars() // ['e', '\u{301}']
s.chars().next() // 'e'
s.chars().skip(1).next() // '\u{301}'

Higher level languages like Javascript won’t save you from having to deal with these issues.

const s = "é";

new TextEncoder().encode(s) // [101, 204, 129] | 3 bytes, as we just saw

new TextEncoder().encode(s)[0] // 101
new TextEncoder().encode(s)[1] // 204
s[0] // 'e'
s[1] // '\u{301}'
s.codePointAt(0) // 101

Additional complexity — Unicode normalisation

If you thought this wasn’t hard enough, different file systems encode characters in different ways because some characters have multiple equivalent unicode representations, this is known as Unicode normalisation.

Some letters like the “é” we just discussed, can be represented in either precomposed (NFC — Normalisation Form Canonical Composed) or decomposed (NFD — Normalisation Form Canonical Decomposed) forms.

While the resulting visual character is the same, the code points composition and code units are different, affecting not only storage space but also processing speed in some cases. Depending on the input method or system, your strings might be a mix of precomposed and decomposed characters. Let’s see how different filesystems handle this.

NTFS Filesystem

The NTFS filesystem is fairly permissive with Unicode characters in filenames, but it does have some restrictions:

Prohibited characters: ‘’, ‘/’, ‘?’, ‘*’, and ‘.’ at the end of filenames
Windows-specific restrictions: ‘<’, ‘>’, ‘:’, ‘|’, ‘“‘, ASCII characters 0–31
Reserved names: ‘CON’, ‘PRN’, ‘AUX’, ‘NUL’, ‘COM1’-’COM9', ‘LPT1’-’LPT9'

Unicode normalisation in NTFS varies depending on how files are created:

Creating a file via the Windows Explorer interface

When typing accented characters, the OS stores them in precomposed form.

Creating a file via the command line

File creation via command line or batch file also uses precomposed form.

Programmatic Creation (e.g., Rust, JavaScript)

The OS stores the exact Unicode sequence provided, without normalisation. This means files with different Unicode representations of the same visual character can coexist.

Mac OS HFS+ Filesystem

HFS+ is more restrictive in some ways, but more consistent in its Unicode handling:

The only explicitly prohibited character is ‘:’
‘/’ is reserved as a path separator

Key differences in Unicode handling:

HFS+ stores filenames in decomposed Unicode form (a modified version of NFD).
This decomposition happens at a low system level and can’t be bypassed.
Example: ‘é’ is stored as two separate Unicode points (‘e’ + combining acute accent).
If you try to create a file with a name that’s visually identical but in a different Unicode form (e.g., precomposed), the system will treat it as the same file.

Note: Some characters, like the Angstrom sign, remain composed despite the general NFD rules.

Ext4 filesystem

Ext4 takes a hands-off approach to Unicode normalization:

No automatic conversion or normalization occurs.
Filenames are stored exactly as they’re input.
This allows files with different Unicode representations of the same visual character to coexist.

Interoperability

These differences can lead to issues when moving files between systems:

Mac to Windows: Filenames like ‘nul’ or ‘file/’ created on Mac may be unusable on Windows.
Windows/Linux to Mac: Files with precomposed characters will be converted to decomposed form on Mac.
Accessing precomposed filenames on Mac: The system will return the decomposed version if it exists.

This will also obviously depend on the transfer method used, for example if using a FAT32 USB stick, some intermediary conversion might take place independently of the operating system.

Conclusion

The solution is to actually use an external library to get the graphemes of the string and not the code points or bytes, so our solution will be

use unicode_segmentation::UnicodeSegmentation;
let name = name.graphemes(true).take(28).collect::<String>();

Most languages don’t have graphemes built-in, so you’ll most likely need to resort to an external library.

I hope this has been useful to you and will help you the next time you’ll have to deal with Unicode / UTF-8 strings!