Surrogates and Combining Characters in Java

Target

Build a new API for the String, StringBuf and Charachter class of the Java SDK that ensures that surrogates and combining characters are preserved (e.g. avoid cutting a string within such a character).

Design ideas

For each affected JDK class build a corresponding wrapper class.
Wrapper class contains static methods as alternative to depricated original methods.
Each Wrapper method takes ab object of the corresponding original class as parameter. The wrapper method works on this object (this object takes the role of the "this" pointer).
There is typically no 1:1 mapping between wrapper and original method. Depending on context and programmer's intention one of several methods have to be chosen. For example, String.charAt can mean, fetch a codeunit (16-bit), a Unicode character (32-bit) or a Grapheme (base character + following combining characters).
Often it is necessary to change not only a function call, but also the surrounding coding. Therefore we scan through the existing codes and search for patterns that occur repeatedly and describe how to rewrite the code
We offer higher level methods to support whole coding patterns with one wrapper method. This makes it easier to rewrite of existing code, documents the intention of the programmer and typically increases performance.

Example

old:

String componentId = id; int i = id.indexOf('_'); if (i >= 0) { componentId = id.substring(0, i); }

new:
String componentId = Utf16Str.SplitBefore( id, '_'); if( componentId == NULL ) { componentId = id; }

Critical classes and methods

class Character: methods dealing with character properties

Example:
bool Character.isLetter( char c )

Requirment:
In order to handle surrogate pairs properly, an interface for 32-bit characters (encoding UTF-32) is required.

Solution Approach:
class UCharacter of ICU4J offers such a 32-bit interface and should be used instead of JDK class Characters.

class String/StringBuf: extract single characters from string

Example:
char String.charAt( int index )

Requirement:
A 16-bit return value is problematic if the 16-bit value is part of a surrogate pairs or part of a combining character sequence.

Solution Approach (depending on the programming context):

continue working on 16-bit codeunits (following operations do not destroy surrogate pairs or composite character sequence)
continue working on 16-bit coudeunits, but skip all "complex characters" (surrogate pairs and composite character sequences)
work on 32-bit characters (e.g. this is necessary to check character properties)
work on Graphemes represented as strings

This alternatives can be offered via static methods and/or via a character iterator class.

class String: searching

Example:
int String.indexOf( char c )

Requirement:
When a matching character or string is found, the character that immediately follows the matching character has to be checked, as well. If the matching sequence is immediately followed by a combining character, than it is not a valid match, because the combining character modifies the last character of the matching sequence.

class String/StringBuffer: extracting parts of a string

Example:
String String.substring( int beginIndex, int endIndex)

Requirement:
When extracting parts from a string, avoid splitting surrogate pairs and Graphemes.

Solution Approach (depending on the programming context):

combine extraction with a preceding search operation (e.g. splitBefore, splitAfter, cutPrefix ...)
Completely remove from the result surrogate pairs and Graphemes that would be split (e.g. when storing strings in a buffer with limited size)
keep index access if the string has a fix format

class StringBuffer: modifying parts of a string

Example: StringBuffer StringBuffer.replace( int beginIndex, int endIndex, String s )

The indices that mark the borders of the operation may not cut surrogate pairs and Graphemes. Principially the same approaches can be applied as for extracting parts from a string.

Rules and restrictions on strings that can be processed

Strings may not contain unpaired surrogates. Unpaired surrogates may be skipped or replaced with another character at any time.
Corresponding to the W3C character model, strings must be normalized early. That means that we normalize strings immediately when they are entered by the user, and that we assume that strings are normalized when we receive them from other software. We use Normalization from C (canonical decompositon followed by canonical composition). Searching, sorting and idendity matching may not work with unnormalized strings.
Strings may not start with a combining character. Since string concatenation with such strings may result in unnormalized strings, searching, sorting and identity matching may not work.
Identifiers shall not contain format characters (e.g. to indicate writing direction from right to left). Format characters are not ignored when searching for identifiers.
In contexts where a character must be quoted, because it is used as delimiter, this character must also be quoted if it is used as a base character of a composite character sequence.

Possible support by check tools

A check tool can detect and warn, if one of the critical operations is done.

It may be possible to avoid warnings, if the critical operations is used in a save context. This can be:

extracting a part of a string with indices that come from previous search or stringlen operations
searching for a character which does not permit combining characters (e.g. \n)
extracting a character as a 16-bit value from a string and comparing this character to other characters that are neither surrogate pairs nor can be followed by combining characters