Truncating UTF String to the given number of bytes while preserving its validity [for DB insert]
Often you need to insert a String from Java into a database column with a fixed length specified in bytes.
Using
string.substring(0, DB_FIELD_LENGTH);isn't enough because it only cuts down the number of characters but in UTF-8 a single character may be represented by 1-4 bytes. But you cannot just turn the string into an array of bytes and use its first DB_FIELD_LENGTH elements because you could end up with an invalid UTF-8 character at the end (one that is represented by 2+ bytes while only its 1st byte fits into the field). There are two solutions for truncation the string in such a way, that it has at most DB_FIELD_LENGTH bytes and is a valid UTF-8 string.
Approach 1: Replace the invalid trailing byte(s) with a 'rectangle'
This is as simple as:int maxLen = DB_FIELD_LENGTH-2; string = new String( string.getBytes("UTF-8") , 0, maxLen, "UTF-8");The new String constructor will automatically replace any invalid character (i.e. incomplete utf-8 char; we may only have one at the end) with the character \uFFFD, which looks like an empty rectangle. This character requires 3 bytes in utf-8 - therefore we decrease DB_FIELD_LENGTH by 2; the resulting string will have either exactly maxLen bytes if its last byte(s) is a valid utf-8 character or maxLen+2 bytes if it isn't valid and this 1 byte was replaced by \uFFFD (3B).
Approach 2: Skip the invalid trailing byte(s) altogether
If you don't want to have the rectangle character in the place of a split multibyte character, you must do yourself what the String constructor does internally, in a bit different way:import java.nio.*; import java.nio.charset.*; Charset utf8Charset = Charset.forName("UTF-8"); CharsetDecoder cd = utf8Charset.newDecoder(); byte[] sba = string.getBytes("UTF-8"); // Ensure truncating by having byte buffer = DB_FIELD_LENGTH ByteBuffer bb = ByteBuffer.wrap(sba, 0, DB_FIELD_LENGTH); // len in [B] CharBuffer cb = CharBuffer.allocate(DB_FIELD_LENGTH); // len in [char] <= # [B] // Ignore an incomplete character cd.onMalformedInput(CodingErrorAction.IGNORE) cd.decode(bb, cb, true); cd.flush(cb); string = new String(cb.array(), 0, cb.position());The string will end with the last valid character in the given range.