string, bytes, runes and characters - Go

0. Make it eassy#

This is all this post is going to talk, if you know this, you don’t need to go through all this post.

a string literal is a slice of bytes:

func main() {
	str := "汉"
	for i:=0; i<len(str); i++{
		fmt.Printf("%x ", str[i])
	}
}
---------------------------
e6 b1 89

a string literal consists of one or more rune values:

func main() {
	str := "abc汉"
	for _, s := range str {
		fmt.Printf("%x ", s)
	}
	fmt.Println()
	for _, s := range str {
		fmt.Printf("%c ", s)
	}
}
--------------------------
61 62 63 6c49 
a b c 汉

Note that the ways of iterating the string literal in the codes above are different.

1. string literals are encoded in UTF-8#

str := "汉"
fmt.Printf("length of str: %d", len(str))

Output:

length of str: 3

Why the output is 3? It’s all about the encoding.

Character 汉 is a chinese character which occupies 3 bytes in utf-8 encoding format, and its code point is 6C49. Code point is a concept defined in Unicode. If you are not familiar with these you can check my another post: https://davidzhu.xyz/post/cs-basics/001-encoding/

In Go, string literals always contain UTF-8 text as long as they have no byte-level escapes. Therefore, the length of str is 3 bytes. This is the string in Go runtime sourcode which will help you understand string in go:

// src/runtime/string.go
type stringStruct struct {
  str unsafe.Pointer
  len int
}

Note that: 'x' represents a single character (called a rune), whereas "x" represents a string literal containing the character x. This is same as in c++ and c.

Size is not length, the size of a stirng value is always 2 words, learn more: Value Variable and Types - Go - David’s Blog

2. string is read-only slice of bytes#

In Go, a string is in effect a read-only slice of bytes. We have disscussed slice in previous post.

func main() {
    str := "汉"
    for i := 0; i < len(str); i++ {
       fmt.Printf("%x ", str[i])
    }
}

Output:

e6 b1 89

As we saw, indexing a string yields its bytes, not its characters: a string is just a bunch of bytes.

Here is another example that I want to talk:

func main() {
	str := "汉"
	fmt.Printf("%x ", str)
	fmt.Println()
	for i := 0; i < len(str); i++ {
		fmt.Printf("%x ", str[i])
	}
}

Output:

6c49 
e6 b1 89

That character 汉 has Unicode value U+6C49 , this is just its Unicode code point, if we want wrtie this string into a file, we have to use a encoding scheme, gbk, utf-8 and other encoding scheme. After 汉 is encoded by utf-8, the value represents it is e6 b1 89 Note that the value that written in the disk is e6 b1 89 which stands for 汉 in utf-8. Learn more: https://davidzhu.xyz/post/cs-basics/001-encoding/

3. string is immutable#

func EchoCommandArgs() {
	var s string
	// The variable os.Args is a slice of strings.
	// range produces a pair of values: the index and the value of the element at that index.
	// we ignore index with _ marker here
	for _, arg := range os.Args[1:]{
		// s += ... will make a new string everytime, because string is immutable
		// The += statement makes a new string by concatenating the old string,
		// a space character, and the next argument, then assigns the new string to s.
		// The old contents of s are no longer in use, so they will be garbage-collected in due course.
		// If the amount of data involved is large, this could be costly.
		// A simpler and more efficient solution would be to use the Join function from the strings package
		s += arg
		s += " "
	}
	fmt.Println(s)
	fmt.Println(os.Args[1:])
	fmt.Println(strings.Join(os.Args[1:], " "))
}
---------------------------------------
$ go run main.go hello hi hi
hello hi hi 
[hello hi hi]
hello hi hi

4. `byte` & `rune`#

A byte in Go is an unsigned 8-bit integer whose type is actually uint8, which can represent an ASCII character.

Go uses rune, which has type int32, to deal with multibyte characters. rune occupies 32bit (4 bytes) and is meant to represent a Unicode CodePoint which is 4 bytes. Source: https://stackoverflow.com/a/19325804/16317008

汉 is a multibyte character. Here should note that we say “rune is used to deal with multibyte characters” doesn’t mean it cannot represent singlebyte characters (ASCII), it’s just a int32 type, not something magic.

func main() {
	str := "abc汉"
	for _, c := range str {
		fmt.Printf("%T ", c)
	}
	fmt.Println()
	for _, c := range str {
		fmt.Printf("%x ", c)
	}
	fmt.Println()
	for i := 0; i < len(str); i++ {
		fmt.Printf("%x ", str[i])
	}
}

Output:

int32 int32 int32 int32 
61 62 63 6c49 
61 62 63 e6 b1 89

As we see here from the output line-3, a string literal is just a slice of bytes, but the first two line prove that we can think a string literal is just a slice of rune which is int32.

Here is another example may help you to understand the string literal and rune:

func isValid(s string) bool {
	for _, c := range s {
    // error: Invalid operation: c == "{" (mismatched types int32 and string)
		if c == "{":
			...
    // no error
    // `'x'` represents a single character (called a rune)
    // c is a rune here
    if c == '{' {
			
		}
	}
	return false
}

5. Go source code is UTF-8#

Source code in Go is defined to be UTF-8 text; no other representation is allowed. That implies that when, in the source code, we write the text

`⌘`

the text editor used to create the program places the UTF-8 encoding of the symbol ⌘ into the source text. When we print out the hexadecimal bytes, we’re just dumping the data the editor placed in the file.

Recall the 汉 example we talked above, the real thing written in the disk is not its Unicode code point but the data encoded by uft-8 which is 3 bytes.

In short, Go source code is UTF-8, so the source code for the string literal is UTF-8 text. If that string literal contains no escape sequences, which a raw string cannot, the constructed string will hold exactly the source text between the quotes. Thus by definition and by construction the raw string will always contain a valid UTF-8 representation of its contents. Similarly, unless it contains UTF-8-breaking escapes like those from the previous section, a regular string literal will also always contain valid UTF-8.

Some people think Go strings are always UTF-8, but they are not: only string literals are UTF-8. As we showed in the previous section, string values can contain arbitrary bytes; as we showed in this one, string literals always contain UTF-8 text as long as they have no byte-level escapes.

To summarize, strings can contain arbitrary bytes, but when constructed from string literals, those bytes are (almost always) UTF-8.

6. `[]byte` vs `[]rune`#

A slice of byte just a slice of uint8, just a list of small numbers (0~127). We usually use a []byte slice when we want store ASCII characters which are singlebye characters, if you want store multibyte characters, you should use []rune or the string literals.

7. `[]byte`, `[]rune` and string conversion will makes copy#

func main() {
  // this is a slice not an array
  // you can replace []byte with []rune
	data := []byte{102, 97, 108, 99, 111, 110}
  // convert byte slice to string
	fmt.Println(string(data))
  // string to byte slice
  fmt.Println([]byte("falcon"))
}
----------------------------------------
[102 97 108 99 111 110]
falcon
[102 97 108 99 111 110]

The compiler will allocate a new piece of memory and copy the data over when you convet string to bytes or bytes to string. The reason is that []byte is not immutable. You can change the bytes within a slice. You can grow the slice by appending bytes to the end. But a string is immutable. You cannot change it and you cannot grow it without creating a whole new string. If a string and a []byte were backed by the same memory, then the string would change when the []byte was changed, and that might not be what you expect.

So, each time you do []byte(“fred”) or string([]byte{0x40, 0x040}), there’s an allocation and a copy. This is done by https://golang.org/pkg/runtime/?m=all#stringtoslicebyte and https://golang.org/pkg/runtime/?m=all#slicebytetostring. And this is normally the right thing to do and normally has no significant consequences.

References:

string, bytes, runes and characters - Go

0. Make it eassy#

1. string literals are encoded in UTF-8#

2. string is read-only slice of bytes#

3. string is immutable#

4. byte & rune#

5. Go source code is UTF-8#

6. []byte vs []rune#

7. []byte, []rune and string conversion will makes copy#

4. `byte` & `rune`#

6. `[]byte` vs `[]rune`#

7. `[]byte`, `[]rune` and string conversion will makes copy#