java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence

Hi,

I have taken a short break! There are many more error messages, which are keep on accumulating behind the screens. But I am so lazy to put them in grassfield.

Anyway, today I got interrupted with an interesting exception

java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence

The scenario was, I am trying to parse an xml string. I am taking the byte array from the xml string, and give that array as input to xml reader stream. I have used java.lang.String.getBytes() for this.

Unfortunately, I got a chinese (or any other funny) characters as a value of one node in the xml. Ooof. I got up with the above error. Later, I found that getBytes() method supports only the western encoding, not UTF-8. So by using java.lang.String.getBytes("UTF-8") method, we solved the issue! nice na!

Advertisements

Make fun with Text : java.util.Scanner


One of my friend came with the sun’s newsletter today morning. I was wondering about their demo on a new class, java.util.Scanner
See, parsing the string becomes very simple, like iterating a list.

Scanner accepts streams, file and other string input mechanisms and parses the string and give is the tokens. (It also allows the user to specify using which encoding the text has been built. goo news for localisation guyz like me). By default, whatever you have given, it is tokenized by having the default delimiter, space, See the following example,

import java.util.*;
import java.io.*;
public class test
{
public static void main(String [] args) throws FileNotFoundException
{
File f = new File(“test.java”);
Scanner scanner = new Scanner(f);
while (scanner.hasNext())
{
System.out.println(scanner.next());
}
scanner.close();
}
}

The output is

C:>java test
import
java.util.*;
import
java.io.*;
public
class
test
{
public
static
void
main(String
[]
args)
throws
FileNotFoundException
{
File
f
=
new
File(“test.java”);
Scanner
scanner
=
new
Scanner(f);
while
(scanner.hasNext())
{
System.out.println(scanner.next());
}
scanner.close();
}
}

funny, isnt it!!!

We can also change the delimiter, see the following example

import java.util.*;
import java.io.*;
public class test
{
public static void main(String [] args) throws FileNotFoundException
{
File f = new File(“test.java”);
Scanner scanner = new Scanner(f);
scanner.useDelimiter(“n”);
while (scanner.hasNext())
{
System.out.println(scanner.next());
}
scanner.close();
}
}

The output is same as that of above code, want to see that one also?

C:>java test
import java.util.*;
import java.io.*;
public class test
{
public static void main(String [] args) throws FileNotFoundException
{
File f = new File(“test.java”);
Scanner scanner = new Scanner(f);
scanner.useDelimiter(“n”);
while (scanner.hasNext())
{
System.out.println(scanner.next());
}
scanner.close();
}
}

Really good one! But I really miss the fun of using the streams 😦

BOM – Byte Order Mark

Hi,
When files are saved from notepad like MS applications, a byte order mark may be added in the first line of your file. This is handled in editors so that you cant view them. But they may put you in trouble when you are reading those files (like IniFile, or reading from any other static files). either you have to use *writeUTF() *and *readUTF() *methods in Data I/O Streams or check for character *(char) 65279* (if you use UTF-8, for other unicode encodings the character may not be the same) at the beginning of you file.

for more info, pl. search for BYTE ORDER MARK in google 🙂

You may know already, when you open a files containing unicode character, they may be displayed as question marks or junk characters in Editplus, Textpad or any other unicode not-ready editors. Notepad, word pad is working fine

—————————————————
*Free* software is a matter of liberty not price. You should think of “free” as in “free speech”.

How to display tamil text with java5 (applets and java swing based applications)

Java 1.3+ should display tamil without altering anything. but jre is
not configured for tamil officially till now. (Devanagari is added).
So you will see only boxes in applets or any other swing applications.
here is the procedure to see tamil characters.

It is considered you have Latha font installed.

Open java directory then go into ‘jre’ directory. Copy-Paste the
“fontconfig.properties.src” file, and rename it as
“fontconfig.properties”.This file then should be the one Java will use
by default. now open the “fontconfig.properties” file.

Find this line :

# Component Font Mappings

Then add :

allfonts.tamil=Latha

You must then find

sequence.allfonts=alphabetic

/default,dingbats,symbol

and add tamil: (am not sure, whether it is really to be done. it
worked without this entry)
sequence.allfonts=alphabetic/default,dingbats,symbol,tamil

add the line

sequence.allfonts.UTF-8.ta=alphabetic/1252,tamil,dingbats,symbol

restart if you are using any java applications. and restart the
browser (deleting temporary files may help if it dint reflect)

post a message, if anything dint work as expected.

-p.

How long is your String object?

from sun’s newsletter,

How long is your text string? You might need to know that answer to check whether user input conforms to data field length constraints. Database text fields usually make you constrain entries to a specific length, so you might need to confirm text length before submitting it. Whatever the reason, we all occasionally need to know the length of a text field. Many programmers use a String object’s length method to get that information. In many situations, the length method provides the right solution. However, this isn’t the only way to determine a String object’s length, and it’s not always the correct way either.

You have at least three common ways to measure text length in the Java platform:

  1. number of char code units
  2. number of characters or code points
  3. number of bytes

Counting char Units

The Java platform uses the Unicode Standard to define its characters. The Unicode Standard once defined characters as fixed-width, 16-bit values in the range U+0000 through U+FFFF. The U+ prefix signifies a valid Unicode character value as a hexadecimal number. The Java language conveniently adopted the fixed-width standard for the char type. Thus, a char value could represent any 16-bit Unicode character.

Most programmers are familiar with the length method. The following code counts the number of char values in a sample string. Notice that the sample String object contains a few simple characters and several characters defined with the Java language’s u notation. The u notation defines a 16-bit char value as a hexadecimal number and is similar to the U+ notation used by the Unicode Standard.

private String testString = "abcdu5B66uD800uDF30";
int charCount = testString.length();
System.out.printf("char count: %dn", charCount);

The length method counts the number of char values in a String object. The sample code prints this:

char count: 7

Counting Character Units

When Unicode version 4.0 defined a significant number of new characters above U+FFFF, the 16-bit char type could no longer represent all characters. Starting with the Java 2 Platform, Standard Edition 5.0 (J2SE 5.0), the Java platform began to support the new Unicode characters as pairs of 16-bit char values called a surrogate pair. Two char units act as a surrogate representation of Unicode characters in the range U+10000 through U+10FFFF. Characters in this new range are called supplementary characters.

Although a single char value can still represent a Unicode value up to U+FFFF, only a char surrogate pair can represent supplementary characters. The leading or high value of the pair is in the U+D800 through U+DBFF range. The trailing or low value is in the U+DC00 through U+DFFF range. The Unicode Standard allocates these two ranges for special use in surrogate pairs. The standard also defines an algorithm for mapping between a surrogate pair and a character value above U+FFFF. Using surrogate pairs, programmers can represent any character in the Unicode Standard. This special use of 16-bit units is called UTF-16, and the Java Platform uses UTF-16 to represent Unicode characters. The char type is now a UTF-16 code unit, not necessarily a complete Unicode character (code point).

The length method cannot count supplementary characters since it only counts char units. Fortunately, the J2SE 5.0 API has a new String method: codePointCount(int beginIndex, int endIndex) . This method tells you how many Unicode code points (characters) are between the two indices. The index values refer to code unit or char locations. The value of the expression endIndex - beginIndex is the same value provided by the length method. This difference is not always the same as the value returned by the codePointCount method. If you’re text contains surrogate pairs, the length counts are definitely different. A surrogate pair defines a single character code point, which can be either one or two char units.

To find out how many Unicode character code points are in a string, use the codePointCount method:

private String testString = "abcdu5B66uD800uDF30";
int charCount = testString.length();
int characterCount = testString.codePointCount(0, charCount);
System.out.printf("character count: %dn", characterCount);

This example prints this:

character count: 6

The testString variable contains two interesting characters, which are a Japanese character meaning “learning” and a character named GOTHIC LETTER AHSA. The Japanese character has Unicode code point U+5B66, which has the same hexadecimal char value u5B66. The Gothic letter’s code point is U+10330. In UTF-16, the Gothic letter is the surrogate pair uD800uDF30. The pair represents a single Unicode code point, and so the character code point count of the entire string is 6 instead of 7.

Counting Bytes

How many bytes are in a String? The answer depends on the byte-oriented character set encoding used. One common reason for asking “how many bytes?” is to make sure you’re satisfying string length constraints in a database. The getBytes method converts its Unicode characters into a byte-oriented encoding, and it returns a byte[]. One byte-oriented encoding is UTF-8, which is unlike most other byte-oriented encodings since it can accurately represent all Unicode code points.

The following code converts text into an array of byte values:

byte[] utf8 = null;
int byteCount = 0;
try {
utf8 = str.getBytes("UTF-8");
byteCount = utf8.length;
} catch (UnsupportedEncodingException ex) {
ex.printStackTrace();
}
System.out.printf("UTF-8 Byte Count: %dn", byteCount);

The target character set determines how many bytes are generated. The UTF-8 encoding transforms a single Unicode code point into one to four 8-bit code units (a byte). The characters a, b, c, and d require a total of only four bytes. The Japanese character turns into three bytes. The Gothic letter takes four bytes. The total result is shown here:

UTF-8 Byte Count: 11

String length
Figure 1. Strings have varying lengths depending on what you count.

Summary

Unless you use supplementary characters, you will never see a difference between the return values of length and codePointCount. However, as soon as you use characters above U+FFFF, you’ll be glad to know about the different ways to determine length. If you send your products to China or Japan, you’re almost certain to find a situation in which length and codePointCount return different values. Database character set enco
dings and some serialization formats encourage UTF-8 as a best practice. In that case, the text length measurement is different yet again. Depending on how you intend to use length, you have a variety of options for measuring it.

More Information

Use the following resources to find more information about the material in this technical tip: