As a follow-on from my post on floating point values and endian swapping, I thought I’d write an article on strings. String encoding is fairly complicated due to the fact that we’re now living in a large global community – ASCII simply won’t cut it any more. On the positive side, if you only need ASCII you can send it across the network as-is. Since it’s only a single byte, there is no worry about endianness.
The problem is when dealing with string encoding that can go over a single byte, so-called multi-byte encoding. This is very likely if you need to support languages other than English. Rather than worry about endianness, BOMs, and other character encoding headaches; may I humbly suggest you try UTF-8. UTF-8 is completely backwards-compatible with ASCII, yet supports the entire unicode character set. It uses the minimum number of bytes for each character (good for saving bandwidth) and is inherently immune to endian issues (for for your sanity). In C# encoding to UTF-8 couldn’t be simpler:
The only issue with UTF-8 is that you can’t just say that the string is X number of bytes long based on the number of characters. Characters can require anywhere from 1 byte to 6 bytes. The simplest solution to this problem is to encode the string into a byte array, then prefix with an int that contains the length of the entire string in bytes. When reading the string back in, you can read in the first int then read however many bytes the int specifies.