avatarDavid Lee

Summary

The provided content discusses the differences in string handling between Go and Rust, emphasizing their distinct approaches to string indexing, Unicode and UTF-8 encoding, and the implications for text processing, file I/O, network communication, web development, database storage, and cryptography.

Abstract

The article delves into the intricacies of string representation and manipulation in Go and Rust, highlighting the contrast between Go's byte-oriented string indexing and Rust's Unicode scalar value approach. It explains that in Go, strings are immutable sequences of bytes, which means that indexing a string returns a byte rather than a character, potentially leading to unexpected results when dealing with multi-byte UTF-8 characters. In contrast, Rust disallows direct indexing of strings to prevent invalid UTF-8 sequences and instead requires slicing or iterating over Unicode code points (runes). The article further explores the use of UTF-8 encoding in both languages, the need for encoding and decoding strings in various use cases such as file I/O, network communication, web development, database storage, and cryptography, and provides practical examples to illustrate these concepts.

Opinions

  • The author suggests that programmers with a background in Python or JavaScript might be surprised by Go's string indexing behavior, which returns bytes instead of characters.
  • The article conveys that Go's design choice to treat strings as byte slices is beneficial for handling binary data and UTF-8 encoded strings efficiently.
  • It is implied that Rust's string handling is more stringent and safer, as it prevents operations that could result in invalid UTF-8 by not allowing direct string indexing.
  • The author posits that the abstraction of encoding and decoding details in Python and JavaScript can be advantageous for most common use cases, while languages like Go and Rust offer more control over the underlying bytes.
  • The article suggests that while both Go and Rust abstract away some aspects of string encoding and decoding, especially in the context of database interactions, developers need to be aware of these processes for correct and secure handling of text data.
  • The author emphasizes the importance of understanding Unicode and UTF-8 encoding, particularly when working with text in multiple languages or when encoding and decoding strings for purposes such as cryptography.

String Secret in Rust and Go

What do you think the output is from below in Go:

s := "hello"
fmt.Println(s[0]) 

If your guess is h , you might need to stay alert!

Coming from a Python or JavaScript background, people might not be aware that in Go, a string is a read-only slice of bytes.

s := "hello"
fmt.Println(s[0])      // Prints "104", the byte value of 'h'

When you index a string, you’re accessing the byte at that index, not a character or a string. This is because strings in Go are designed to handle not only text, but also binary data.

This design choice is particularly useful when dealing with UTF-8 encoded strings. UTF-8 is a variable-width encoding, meaning that characters can be represented with different numbers of bytes. By treating strings as byte slices, Go allows you to work with UTF-8 strings in a memory-efficient way.

However, this also means that if you want to get a substring or a character from a string, you need to use the string function or string slicing, not indexing.

s := "hello"
fmt.Println(s[0])      // Prints "104", the byte value of 'h'
fmt.Println(string(s[0])) // Prints "h", the string representation of the byte
fmt.Println(s[0:1])    // Also prints "h", the first character of the string

s[0] gives you the byte value of the first character, string(s[0]) gives you the string representation of the byte, and s[0:1] gives you the first character of the string.

People from Python and JavaScript background knows:

Python treats strings more like JavaScript does in this context. In Python and JavaScript, when you index a string, you get a one-character string at that index, not a byte.

colorset = "10 red"
print(colorset[0])  # Prints "1"

In Python, colorset[0] gives you the string "1", not a byte. This is because Python strings are sequences of characters, and indexing a string gives you a one-character string.

What is Unicode and UTF-8 and why use them?

Main reason: computers don’t understand.

Unicode is a standard that assigns a unique identifier to every character used in written languages, allowing computers to use or manipulate text from any linguistic tradition.

UTF-8 (Unicode Transformation Format — 8-bit) is a method for encoding Unicode characters using 8-bit sequences. It’s a variable-width encoding, meaning that not all characters use the same number of bytes. ASCII characters (the basic Latin alphabet, digits, and punctuation) use one byte, but other characters can use up to four bytes.

The advantage of UTF-8 is that it’s fully compatible with ASCII, the most basic character encoding standard, which means that any system that understands ASCII can interpret the 7-bit ASCII subset of UTF-8. It’s also efficient for languages that use Latin script, because they mostly use one byte per character, just like ASCII.

Go uses UTF-8 for string encoding because it’s a flexible and efficient way to represent text in any language.

For example, consider the string “Hello”. In ASCII or UTF-8, each character is represented by a number: ‘H’ is 72, ‘e’ is 101, ‘l’ is 108, ‘l’ is 108, and ‘o’ is 111. These numbers can then be represented in binary, which is what the computer understands.

However, ASCII only supports 128 characters, which is enough for English but not for other languages. Unicode supports over a million characters, covering virtually all characters used in every written language. UTF-8 is a way to encode these Unicode characters in a space-efficient manner.

Here’s how you might represent “Hello” and “Привет” (which means “Hello” in Russian) in Go:

english := "Hello"
russian := "Привет"

fmt.Println([]byte(english)) // Prints: [72 101 108 108 111]
fmt.Println([]byte(russian)) // Prints: [208 159 209 128 208 184 208 178 208 181 209 130]

[]byte(english) gives you the byte representation of "Hello" in UTF-8, and []byte(russian) gives you the byte representation of "Привет". As you can see, the Russian text requires more bytes per character because it uses characters not found in the ASCII set.

How to access a specific character in Other languages like Russia?

You just need to convert the string to a slice of runes first:

russian := "Привет"
runes := []rune(russian)
fmt.Println(string(runes[0])) // Prints "П"

A rune in Go represents a Unicode code point. When you convert a string to a slice of runes, you get a slice where each element is a rune that represents a character in the string. This allows you to handle strings that contain non-ASCII characters correctly.

Another way:

russian := "Привет"
for index, runeValue := range russian {
    fmt.Printf("Character at index %d is %c\n", index, runeValue)
}

In Go, we can use the range keyword to iterate over a string by Unicode code points (runes), which allows you to handle non-ASCII characters correctly.

Why Python and JavaScript don’t encode & decode strings?

Python and JavaScript do encode strings, but they abstract away the details of encoding from the programmer for most common use cases. Both languages use Unicode for string encoding, which allows them to represent a wide range of characters from different languages.

In Python and JavaScript, strings are sequences of Unicode characters. When you index a string, you get a one-character string at that index. They handle the encoding and decoding of strings into bytes when necessary, such as when reading or writing to a file or a network stream.

The safety of a language’s string handling doesn’t necessarily depend on whether it exposes details of string encoding to the programmer. Both Python and JavaScript have mechanisms to handle invalid Unicode sequences when encoding and decoding strings. The key difference is that languages like Go and Rust allow more direct control over the bytes that make up a string, which can be an advantage in certain use cases.

Rust

let colorset = "10 red";
println!("{}", colorset[0]); 

What do you think the output of above code in Rust?

If you try to compile this code, you’ll get an error message like this:

error[E0277]: the type `str` cannot be indexed by `{integer}`

In Rust, you can’t index into a String or a string slice (&str) directly with a single number due to the complexities of UTF-8 encoding.

Why string indexing gets a byte in Go but `compile error` in Rust?

Go is designed with simplicity and efficiency in mind, and working with bytes directly can be more efficient in some cases. However, this means you need to be careful when working with strings that contain non-ASCII characters, as these characters can be represented by multiple bytes in UTF-8.

Rust doesn’t allow direct string indexing to prevent creating invalid strings by splitting a character’s bytes. This aligns with Rust’s overall focus on safety.

This is because a single character in a UTF-8 encoded string can span multiple bytes, and indexing could split a character, leading to invalid UTF-8. To get a substring in Rust, you need to use slicing with a range.

To add more, in Rust, strings are sequences of Unicode scalar values (each of which is encoded as a sequence of bytes using UTF-8). Rust doesn’t allow you to index a string directly because a single Unicode character can be represented by multiple bytes. If you were allowed to index a string like in Go, you could accidentally index into the middle of a character, which would result in invalid Unicode.

How to fix above code in Rust?

Use slicing with a range to get a substring:

let colorset = "10 red";
println!("{}", &colorset[0..1]); // Prints "1"

&colorset[0..1] gives you a string slice containing the first character of colorset.

If you need to access individual characters in a string, you can use the chars method to get an iterator over the characters:

let colorset = "10 red";
for ch in colorset.chars() {
    println!("{}", ch);
}

Output:

1
0

r
e
d

A little extra knowledge:

In Rust, the & operator is used to create a reference to a value. When you use slicing on a string in Rust, like colorset[0..1], the result is a temporary string slice. In many contexts, including the argument to println!, you need to pass a reference to a value rather than the value itself. That's why you write &colorset[0..1] instead of colorset[0..1].

In Go, when you index a string like colorset[0], you're directly accessing the byte at that index, not a reference to it. This is because Go's string type is an immutable sequence of bytes.

This is different from languages like Rust, where you often work with references to values rather than the values themselves. In Rust, this is done to prevent unnecessary copying of data and to control how and when data is mutated. Go, on the other hand, is designed with simplicity in mind, and often opts for copying data rather than working with references.

a greatly simplified view of communication protocol layers

Use Cases of Encoding and Decoding Strings:

Let’s dive into some uses cases and compare how Go and Rust handle strings encoding and decoding (Above is Go syntax, below is Rust syntax).

File I/O:

When you read text from a file or write text to a file, you need to decode or encode the text as a sequence of bytes.

Network Communication:

When sending and receiving data over a network, the data is transmitted as bytes. If the data includes text, it needs to be encoded to bytes before transmission and decoded back into text upon receipt.

In both examples, a server is started that listens for incoming connections and prints any messages it receives. Then a client connects to the server and sends the message “Hello, World!”. The message is encoded as bytes for transmission and then decoded back into text upon receipt.

In the context of network programming, a buffer is often used when reading data from a network connection because the amount of data that’s available to read can vary. By reading data into a buffer, you can handle whatever amount of data is available without needing to know in advance how much data to expect.

It’s worth noting that in Go, we used the bufio.NewReader(conn).ReadString('\n') method to read the incoming data. This method reads data until it encounters a newline character ('\n'). It automatically handles buffering the incoming data, so we don't need to manually create a buffer.

In Rust, let mut buffer = [0; 512]; creates a mutable array named buffer with 512 elements (512 bytes in this case), all initialized to 0. This array is used as a buffer for reading data from the network.

When you call stream.read(&mut buffer), the read method fills the buffer with the bytes that it reads from the network. The number of bytes it reads is determined by the size of the buffer, which is why you need to specify a size when you create it.

Both approaches are valid and have their uses depending on the situation. The Go approach can be simpler if you’re reading data that’s delimited in some way (like lines of text), while the Rust approach gives you more control over the reading process.

Web Dev:

In HTML and URL encoding, certain characters are encoded to ensure they are transmitted safely over the internet. For example, spaces in URLs are often encoded as %20.

A URL with a space in the query string is parsed and then encoded. The space in the query string is encoded as %20. The Go uses the net/url package, and the Rust uses the url crate.

Why there is a space in a URL?

The URL “https://example.com/search?q=helloworld" doesn’t contain any unsafe characters, so it doesn’t need to be encoded. However, parsing the URL can still be useful to validate its structure, extract different parts (like the scheme, host, path, query, etc.), or modify it in some way.

In practice, URLs often contain data that can include unsafe characters. For example, a search query might include spaces or other special characters. That’s why I used “https://example.com/search?q=hello world” as an example, to illustrate how unsafe characters in a URL can be encoded.

Even if a URL doesn’t currently contain any unsafe characters, it’s still a good practice to parse and encode URLs in your code. This ensures that your code will still work correctly if the URL data changes in the future to include unsafe characters.

Database Storage:

When storing and retrieving text from a database, the text often needs to be encoded as bytes.

Pay attention that the encoding and decoding of text as bytes is handled automatically by the database libraries. This is a common feature of most database libraries in high-level languages like Go and Rust.

In Go, when you execute the db.Exec("INSERT INTO test (content) VALUES (?)", "Hello, World!") statement, the database/sql package automatically encodes the "Hello, World!" string as bytes before sending it to the SQLite database.

When you later retrieve the data with db.Query("SELECT content FROM test") and rows.Scan(&content), the database/sql package automatically decodes the bytes from the SQLite database back into a Go string.

In Rust, the rusqlite crate provides similar automatic encoding and decoding. The "Hello, World!" string is automatically encoded as bytes when you execute the conn.execute("INSERT INTO test (content) VALUES (?1)", params!["Hello, World!"]) statement. It's then automatically decoded back into a Rust String when you retrieve it with stmt.query_map(params![], |row| { Ok(row.get(0)?) }).

So, while the encoding and decoding is happening, it’s abstracted away by the database libraries.

Text Processing:

When working with text in different languages, encoding helps to handle special characters and symbols that are not part of the ASCII character set. And this has been one of the the main focus in this article.

In Go and Rust, strings are encoded as UTF-8 by default.

In the code above: text := "こんにちは, 世界!" in Go or let text = "こんにちは, 世界!"; in Rust, the string is automatically encoded as UTF-8.

When you print the string with fmt.Println(text) in Go or println!("{}", text); in Rust, the UTF-8 encoded string is automatically decoded back into characters to display it on the screen.

It’s abstracted away by the language’s standard library

Cryptography:

Encoding is used in various cryptographic techniques to transform data into a format that is unreadable without a decryption key.

Let’s use AES encryption as an example:

In Go, using the crypto/aes and crypto/cipher packages
In Rust, using the aes, block-modes, and base64 crates

The plaintext “Hello, World!” is encrypted using AES encryption and the ciphertext is then encoded as a base64 string. Base64 encoding is commonly used in cryptography to encode binary data, especially when that data needs to be stored and transferred over media that are designed to deal with text.

A little knowledge:

The variable iv stands for "Initialization Vector". In cryptography, an initialization vector (IV) is an arbitrary number that can be used along with a secret key for data encryption. This number, also called a nonce, is employed only one time in any session.

The use of an IV prevents repetition during the encryption process, making it impossible for hackers who use dictionary attacks to decrypt the exchanged encrypted message by comparing it with pre-computed encrypted ‘words’. Essentially, an IV is an added input to a cryptographic primitive, making it more random and less predictable, thus more secure.

In the context of the Go code, iv := ciphertext[:aes.BlockSize] is creating an initialization vector of the same size as the AES block size. This IV is then used in the cipher.NewCFBEncrypter(block, iv) function to create a new stream cipher.

Other relevant articles on Rust and Go:

Rust or Go (heated debate version)

Rust or Go

Reference:

https://learn.microsoft.com/en-us/windows/win32/seccrypto/encoded-and-decoded-data

Golang
Rust
Programming
Web Development
Software Development
Recommended from ReadMedium