File Encoding Appendix

File Encoding

In countries using Latin (also called Roman) characters, simple text files (known as ASCII files where ASCII stands for American Standard Code for Information Interchange) define a series of characters which can be represented by the numbers 0-127. These characters included the standard a-z, A-Z, and 0-9 characters, along with common punctuation (including quotaton marks and apostrophes, for example, but not including "smart quotes" or "curly apostrophes"). With only 128 characters, this requires only 7 bits (2 to the 7th power is 128), but most computers used 8-bits per character to save files, and are capable of storing 256 different characters. This is where "encodings" come in. Let's say to those "basic" characters you want to add some typical Spanish characters like é or ñ, or perhaps Swedish, or Japanese, or Russian, or Hebrew, or Arabic, etc. The historic solution was something called "codepages." On Macs, for example, a typical "codepage" or encoding was (and still is) called MacOS Roman. In this encoding, the letter é is represented by the number 142. On Windows, a common "codepage" or encoding was (and still is) called Windows-1252 (or its close relative ISO-8859-1), sometimes called Windows Latin 1. In this encoding, the letter é is represented by the number 233. You can begin to see the problem, and we haven't even gotten to the non-Latin alphabets! If you open a file, you have to know exactly which of many encodings were used to create it; if you don't, a character which started out as é on the Windows computer you created the file on might end up as an È when you open the file on a Macintosh. A file which started out as Cyrilic characters in Russia could be opened on a computer in the U.S., but when interpreted as Latin characters, it would be complete nonsense.

In recent years, particularly since the rise of the Internet, a new universal standard called Unicode has arisen to (ultimately) replace all of these. The idea is to create a single set of numbers, ranging not from 0 to 255 but over a much larger range (but for our intents and purposes, consider it 0 to 32767), so that every character (and punctuation and symbol etc.) in every language (or at least, almost every language) can be represented by a common set of numbers. In this standard, the character é is always character 233, no matter whether it is generated on a Mac, Windows, or Linux computer, and no matter whether it is generated by a computer in the U.S., or Spain, or Russia. Characters is Cyrilic, or Arabic, or Japanese, etc. each have their own, unique values. Now the only question is, how to put those values into files.

There are several methods of doing so, but the two most common are called UTF-8 and UTF-16, the latter sometimes being referred to simply as Unicode. UTF-16 uses one 16-bit word to hold a character, whereas UTF-8 uses a variable scheme of one to four bytes (8 to 32 bits) to represent a character. UTF-16 is evidently more straightforward, but for Western languages using the Latin alphabet, it will always take up more space. UTF-8 is more complicated (but that's the computer's problem, not yours), but it has one huge advantage – if your file has no accented characters, and no characters in non-Latin alphabets, a UTF-8 file is 100% identical to a simple ASCII file of the type that has been in use (in the West, at least) for decades. UTF-8 has one more advantage, as well, which is that the current version of all web browsers is capable of recognizing web pages which use UTF-8 encoding, and hence capable of displaying multiple languages on the same page, like this: 互可丕丗 (no, we have no idea what that means!).

How does this apply to iPhone software from Stevens Creek Software? Our standard is that for output from the iPhone, we use UTF-8 encoding. If there are no accented characters or non-Latin characters, you'll be able to open and read and modify this file using any program capable of simple text editing, because it will be identical to a simple ASCII file. If there are accented characters or non-Latin characters, then you must open it (and modify it) using a program capable of handling UTF-8 files. Current versions of TextEdit on the Macintosh (the standard simple text editor included on all Macs) and Notepad (but not Wordpad) on Windows (the standard simple text editor included on all Windows computers) are capable of reading and writing UTF-8 files. With TextEdit (Macintosh), there is a preference which you can set (under Open and Save) where you specify the default encoding for files you open and files you save. When you save a file, there is also a pull-down list in the dialog box which appears where you can specify the encoding of the file. Notepad on Windows will open UTF-8 files with no special settings required, and when you save files from that application, you can specify the encoding for saving them as UTF-8.

Most of our iPhone applications will open files which have been saved in either ASCII, UTF-8, or UTF-16 formats. In general, however, we recommend that if you are preparing files on the computer for transfer into the iPhone, you should stick with UTF-8. If you are saving a file to transfer to our software on the iPhone from an application which does not allow you to specify the encoding, perhaps a database or spreadsheet program, and if the program contains either accented or non-Latin characters, you are most likely saving it in a "codepage" which our software won't be happy with. Before attempting to transfer it into our software, you should open it with either TextEdit (Mac) or Notepad (Windows) and then re-save it as a UTF-8 file. If the file does not have any accented characters or "smart" punctuation, but just plains Latin characters and "simple" punctuation, then saving it as a "plain" file will be fine, because that file will be both ASCII and UTF-8 simultaneously.

If you need to open a file that has come from one of our iPhone applications with a program which does not understand UTF-8 encoding, but only opens simple ASCII files, the worst thing that will happen is that some of the accented characters will be changed, often into strange two-character combinations. For example, an ñ character might be transformed into this: √±. All the other characters in the file, however, will be unaffected. As an alternative, you can simply open the file first with a program that does understand UTF-8 (TextEdit or Notepad), and then save the file as either MacOS Roman (if you're on a Mac) or Windows Latin1 (if you're on Windows). Then that file should open properly with the program which can't read UTF-8 files.

Back to Manual