Surrogates and Combining Characters in Java
Target
Build a new API for the String, StringBuf and Charachter class of the Java
SDK that ensures that surrogates and combining characters are preserved (e.g.
avoid cutting a string within such a character).
Design ideas
- For each affected JDK class build a corresponding wrapper class.
- Wrapper class contains static methods as alternative to depricated original
methods.
- Each Wrapper method takes ab object of the corresponding original class as
parameter. The wrapper method works on this object (this object takes the role
of the "this" pointer).
- There is typically no 1:1 mapping between wrapper and original method.
Depending on context and programmer's intention one of several methods have to
be chosen. For example, String.charAt can mean, fetch a codeunit (16-bit), a
Unicode character (32-bit) or a Grapheme (base character + following combining
characters).
- Often it is necessary to change not only a function call, but also the
surrounding coding. Therefore we scan through the existing codes and search for
patterns that occur repeatedly and describe how to rewrite the code
- We offer higher level methods to support whole coding patterns with one
wrapper method. This makes it easier to rewrite of existing code, documents the intention
of the programmer and typically increases performance.
Example
old:
String componentId = id;
int i = id.indexOf('_');
if (i >= 0) {
componentId = id.substring(0, i);
}
new:
String componentId = Utf16Str.SplitBefore( id, '_');
if( componentId == NULL ) {
componentId = id;
}
Critical classes and methods
class Character: methods dealing with character properties
Example:
bool Character.isLetter( char c )
Requirment:
In order to handle surrogate pairs properly, an interface for 32-bit
characters (encoding UTF-32) is required.
Solution Approach:
class UCharacter of ICU4J offers such a 32-bit interface and should be
used instead of JDK class Characters.
class String/StringBuf: extract single characters from string
Example:
char String.charAt( int index )
Requirement:
A 16-bit return value is problematic if the 16-bit value is part of a surrogate
pairs or part of a combining character sequence.
Solution Approach (depending on the programming context):
- continue working on 16-bit codeunits (following operations do not destroy
surrogate pairs or composite character sequence)
- continue working on 16-bit coudeunits, but skip all "complex
characters" (surrogate pairs and composite character sequences)
- work on 32-bit characters (e.g. this is necessary to check character
properties)
- work on Graphemes represented as strings
This alternatives can be offered via static methods and/or via a character
iterator class.
class String: searching
Example:
int String.indexOf( char c )
Requirement:
When a matching character or string is found, the character that immediately
follows the matching character has to be checked, as well. If the matching sequence
is immediately followed by a combining character, than it is not a valid match,
because the combining character modifies the last character of the matching
sequence.
class String/StringBuffer: extracting parts of a string
Example:
String String.substring( int beginIndex, int endIndex)
Requirement:
When extracting parts from a string, avoid splitting surrogate pairs and Graphemes.
Solution Approach (depending on the programming context):
- combine extraction with a preceding search operation (e.g. splitBefore,
splitAfter, cutPrefix ...)
- Completely remove from the result surrogate pairs and Graphemes that would be split
(e.g. when storing strings in a buffer with limited size)
- keep index access if the string has a fix format
class StringBuffer: modifying parts of a string
Example: StringBuffer StringBuffer.replace( int beginIndex, int endIndex,
String s )
The indices that mark the borders of the operation may not cut surrogate
pairs and Graphemes. Principially the same approaches can be applied as for
extracting parts from a string.
Rules and restrictions on strings that can be processed
- Strings may not contain unpaired surrogates. Unpaired surrogates may be
skipped or replaced with another character at any time.
- Corresponding to the W3C character model, strings must be normalized early.
That means that we normalize strings immediately when they are entered by
the user, and that we assume that strings are normalized when we receive
them from other software. We use Normalization from C (canonical
decompositon followed by canonical composition). Searching, sorting and
idendity matching may not work with unnormalized strings.
- Strings may not start with a combining character. Since string
concatenation with such strings may result in unnormalized strings,
searching, sorting and identity matching may not work.
- Identifiers shall not contain format characters (e.g. to indicate writing
direction from right to left). Format characters are not ignored when
searching for identifiers.
- In contexts where a character must be quoted, because it is used as
delimiter, this character must also be quoted if it is used as a base
character of a composite character sequence.
Possible support by check tools
A check tool can detect and warn, if one of the critical operations is done.
It may be possible to avoid warnings, if the critical operations is used in a
save context. This can be:
- extracting a part of a string with indices that come from previous search
or stringlen operations
- searching for a character which does not permit combining characters (e.g.
\n)
- extracting a character as a 16-bit value from a string and comparing this
character to other characters that are neither surrogate pairs nor can be
followed by combining characters