jakub holý

building the right thing, building it right, fast

Truncating UTF String to the given number of bytes while preserving its validity [for DB insert]

2007-11-02 08:03:46Databases

Often you need to insert a String from Java into a database column with a fixed length specified in bytes. Using
string.substring(0, DB_FIELD_LENGTH);
isn't enough because it only cuts down the number of characters but in UTF-8 a single character may be represented by 1-4 bytes. But you cannot just turn the string into an array of bytes and use its first DB_FIELD_LENGTH elements because you could end up with an invalid UTF-8 character at the end (one that is represented by 2+ bytes while only its 1st byte fits into the field). There are two solutions for truncation the string in such a way, that it has at most DB_FIELD_LENGTH bytes and is a valid UTF-8 string.

Approach 1: Replace the invalid trailing byte(s) with a 'rectangle'

This is as simple as:
int maxLen = DB_FIELD_LENGTH-2;
string = new String( string.getBytes("UTF-8") , 0, maxLen, "UTF-8");
The new String constructor will automatically replace any invalid character (i.e. incomplete utf-8 char; we may only have one at the end) with the character \uFFFD, which looks like an empty rectangle. This character requires 3 bytes in utf-8 - therefore we decrease DB_FIELD_LENGTH by 2; the resulting string will have either exactly maxLen bytes if its last byte(s) is a valid utf-8 character or maxLen+2 bytes if it isn't valid and this 1 byte was replaced by \uFFFD (3B).

Approach 2: Skip the invalid trailing byte(s) altogether

If you don't want to have the rectangle character in the place of a split multibyte character, you must do yourself what the String constructor does internally, in a bit different way:
import java.nio.*; import java.nio.charset.*;
Charset utf8Charset = Charset.forName("UTF-8");
CharsetDecoder cd = utf8Charset.newDecoder();
byte[] sba = string.getBytes("UTF-8");
// Ensure truncating by having byte buffer = DB_FIELD_LENGTH
ByteBuffer bb = ByteBuffer.wrap(sba, 0, DB_FIELD_LENGTH); // len in [B]
CharBuffer cb = CharBuffer.allocate(DB_FIELD_LENGTH); // len in [char] <= # [B]
// Ignore an incomplete character
cd.decode(bb, cb, true);
string = new String(cb.array(), 0, cb.position());
The string will end with the last valid character in the given range.

Approach 3: Manually remove the invalid trailing bytes

As you can see, the approach 2 requires quite lot of coding and method calls. If you know the details of UTF-8, namely how to distinguish an invalid byte or byte sequence then you can simply truncate the byte array and then remove/replace the invalid bytes. I'd be glad for the code :-)