Wednesday, June 6, 2012

Refactoring C to C++ Part 2 - Strings, Strings, and More Strings

In the previous entry in this series, a general info dump on a converted class was taken. This time a more general rule will be examined: string usage in C++.

One large improvement in C++ coding over C is in the area of strings. With C, a string is just a random memory pointer to what should be a NULL terminated sequence of proper characters. In practice there ends up being many ways that problems with C strings can creep in.

  • the final zero-byte null terminator might be missed during creation.
  • some common library functions will ensure null termination, while others do not.
  • to determine the length of a string, the entire buffer needs to be walked
  • resizing and appending to strings can be complex multistage operations with many potential failure points.
  • resizing a string most often invalidates the existing pointer.
  • tracking different character encodings can be difficult.

With C++ in general strings are represented by the standard class std::string. However that still does not address the issue of encodings. What the meaning of an individual byte or set of bytes is can depend on many factors. Modern programs have to deal with multiple encodings... even if their developers do not always realize it.

With GTK+ programs there are three main encoding values to keep aware of: locale encoding, filesystem encoding and internal encoding. The internal encoding is used for UI widgets and most internal GTK+ calls. The encoding itself is UTF-8. The locale encoding can vary at runtime, and although it is commonly also UTF-8, it can be any other. The filesystem encoding is different, and used for paths. This can vary greatly for systems that have been upgraded over time.

I'll cover encodings a bit more at a different time, but in the context of GTK+ and C++ the potential encoding allows us to select between the two main classes for strings:

std::string
The standard class for strings in C++. Should be used when the data might be in an encoding other than UTF-8. This is such for GTK+ and Glib APIs that operate with either locale or filesystem encodings.
Glib::ustring
A class from Gtkmm that represents strings of UTF-8 data. Aside from other things it manages details of multi-byte UTF-8 single characters, etc.

Thankfully we end up with some fairly simple rules for C++ programs:

  • Use a single common encoding for as much of a program as possible. For GTK+ this is UTF-8.
  • Avoid using legacy C strings such as "char *" or "gchar *"
  • Use Glib::ustring for all UTF-8 encoded strings.
  • Use std::string for strings that might be in different encodings.
  • Be very careful about string conversions, and use explicit encodings.
  • Do not mix strings and byte data.
  • Use std::vector<uint8_t> for random byte buffers.
  • For parameters passed into functions, use "Glib::ustring const &" or "std::string const &".
  • For return values, prefer functions that return "Glib::ustring" or "std::string" (note that these do not use 'const' nor references).
  • For functions that return multiple strings, take in parameters of either "Glib::string &" or "std::string &"

Finally we end up with a very important question: does any of this make sense? Hopefully some guidance can be quickly drawn from this information. However, if any point needs more clarification, or was missed, please speak up and let me know what to address.

Read more!