How long is your String object?

from sun’s newsletter,

How long is your text string? You might need to know that answer to check whether user input conforms to data field length constraints. Database text fields usually make you constrain entries to a specific length, so you might need to confirm text length before submitting it. Whatever the reason, we all occasionally need to know the length of a text field. Many programmers use a String object’s length method to get that information. In many situations, the length method provides the right solution. However, this isn’t the only way to determine a String object’s length, and it’s not always the correct way either.

You have at least three common ways to measure text length in the Java platform:

  1. number of char code units
  2. number of characters or code points
  3. number of bytes

Counting char Units

The Java platform uses the Unicode Standard to define its characters. The Unicode Standard once defined characters as fixed-width, 16-bit values in the range U+0000 through U+FFFF. The U+ prefix signifies a valid Unicode character value as a hexadecimal number. The Java language conveniently adopted the fixed-width standard for the char type. Thus, a char value could represent any 16-bit Unicode character.

Most programmers are familiar with the length method. The following code counts the number of char values in a sample string. Notice that the sample String object contains a few simple characters and several characters defined with the Java language’s u notation. The u notation defines a 16-bit char value as a hexadecimal number and is similar to the U+ notation used by the Unicode Standard.

private String testString = "abcdu5B66uD800uDF30";
int charCount = testString.length();
System.out.printf("char count: %dn", charCount);

The length method counts the number of char values in a String object. The sample code prints this:

char count: 7

Counting Character Units

When Unicode version 4.0 defined a significant number of new characters above U+FFFF, the 16-bit char type could no longer represent all characters. Starting with the Java 2 Platform, Standard Edition 5.0 (J2SE 5.0), the Java platform began to support the new Unicode characters as pairs of 16-bit char values called a surrogate pair. Two char units act as a surrogate representation of Unicode characters in the range U+10000 through U+10FFFF. Characters in this new range are called supplementary characters.

Although a single char value can still represent a Unicode value up to U+FFFF, only a char surrogate pair can represent supplementary characters. The leading or high value of the pair is in the U+D800 through U+DBFF range. The trailing or low value is in the U+DC00 through U+DFFF range. The Unicode Standard allocates these two ranges for special use in surrogate pairs. The standard also defines an algorithm for mapping between a surrogate pair and a character value above U+FFFF. Using surrogate pairs, programmers can represent any character in the Unicode Standard. This special use of 16-bit units is called UTF-16, and the Java Platform uses UTF-16 to represent Unicode characters. The char type is now a UTF-16 code unit, not necessarily a complete Unicode character (code point).

The length method cannot count supplementary characters since it only counts char units. Fortunately, the J2SE 5.0 API has a new String method: codePointCount(int beginIndex, int endIndex) . This method tells you how many Unicode code points (characters) are between the two indices. The index values refer to code unit or char locations. The value of the expression endIndex - beginIndex is the same value provided by the length method. This difference is not always the same as the value returned by the codePointCount method. If you’re text contains surrogate pairs, the length counts are definitely different. A surrogate pair defines a single character code point, which can be either one or two char units.

To find out how many Unicode character code points are in a string, use the codePointCount method:

private String testString = "abcdu5B66uD800uDF30";
int charCount = testString.length();
int characterCount = testString.codePointCount(0, charCount);
System.out.printf("character count: %dn", characterCount);

This example prints this:

character count: 6

The testString variable contains two interesting characters, which are a Japanese character meaning “learning” and a character named GOTHIC LETTER AHSA. The Japanese character has Unicode code point U+5B66, which has the same hexadecimal char value u5B66. The Gothic letter’s code point is U+10330. In UTF-16, the Gothic letter is the surrogate pair uD800uDF30. The pair represents a single Unicode code point, and so the character code point count of the entire string is 6 instead of 7.

Counting Bytes

How many bytes are in a String? The answer depends on the byte-oriented character set encoding used. One common reason for asking “how many bytes?” is to make sure you’re satisfying string length constraints in a database. The getBytes method converts its Unicode characters into a byte-oriented encoding, and it returns a byte[]. One byte-oriented encoding is UTF-8, which is unlike most other byte-oriented encodings since it can accurately represent all Unicode code points.

The following code converts text into an array of byte values:

byte[] utf8 = null;
int byteCount = 0;
try {
utf8 = str.getBytes("UTF-8");
byteCount = utf8.length;
} catch (UnsupportedEncodingException ex) {
ex.printStackTrace();
}
System.out.printf("UTF-8 Byte Count: %dn", byteCount);

The target character set determines how many bytes are generated. The UTF-8 encoding transforms a single Unicode code point into one to four 8-bit code units (a byte). The characters a, b, c, and d require a total of only four bytes. The Japanese character turns into three bytes. The Gothic letter takes four bytes. The total result is shown here:

UTF-8 Byte Count: 11

String length
Figure 1. Strings have varying lengths depending on what you count.

Summary

Unless you use supplementary characters, you will never see a difference between the return values of length and codePointCount. However, as soon as you use characters above U+FFFF, you’ll be glad to know about the different ways to determine length. If you send your products to China or Japan, you’re almost certain to find a situation in which length and codePointCount return different values. Database character set enco
dings and some serialization formats encourage UTF-8 as a best practice. In that case, the text length measurement is different yet again. Depending on how you intend to use length, you have a variety of options for measuring it.

More Information

Use the following resources to find more information about the material in this technical tip:

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s