Symbian Developer Library

SYMBIAN OS V6.1 EDITION FOR C++

[Index] [Glossary] [Previous] [Next]



Folding and collation (comparing strings)

There are two techniques that may be used to modify the characters in a descriptor prior to performing operations such as comparisons on text strings:


Folding

Folding is a relatively simple way of normalising text for comparison by removing case distinctions, converting accented characters to characters without accents etc. Folding is used for tolerant comparisons, i.e. comparisons that are biased towards a match.

Variants of member functions that fold are provided where appropriate. For example, TDesC16::CompareF() for folded comparison.

[Top]


Collation

Collation is a much better and more powerful way to compare strings and produces a dictionary-like ('lexicographic') ordering. Folding cannot remove piece accents or deal with correspondences that are not one-to-one like the mapping from German upper case SS to lower case ß. In addition, folding cannot optionally ignore punctuation.

For languages using the Latin script, for example, collation is about deciding whether to ignore punctuation, whether to fold upper and lower case, how to treat accents, and so on. In a given locale there is usually a standard set of collation rules that can be used.

Variants of member functions that use collation are provided where appropriate. For example, TDesC16::CompareC() for collated comparison.


Comparing and sorting strings

The TDesC16::CompareC() variant prototyped as:

TInt CompareC(const TDesC16& aDes, TInt aMaxLevel, const TCollationMethod* aCollationMethod) const;

returns 0, if two strings match.

There are many ways in which two strings can match, even when they do not have the same length:

The collation level is an integer that can take one of the values: 0, 1, 2 or 3, and determines how tightly the matching of two strings should be. This value is passed as the second parameter to CompareC(). The values have the following meanings:

At levels 0-2:

At level 3 these are treated differently.

If the aim is to sort strings, then level 3 must be used. For any strings a and b, if a < b for some level of collation, then a < b for all higher levels of collation as well. It is impossible, therefore, to affect the order that is generated by using lower collation levels than 3. This just causes similar strings to sort in a random order. In standard English, sorting at level 3 gives the following order:

bat < bee < BEE < bug

The case of the B only affects the comparison after all the letter identities have been found to be the same - this is usually what people are trying to achieve by using lower collation levels than 3 for sorting. It is never necessary.

The sort order can be affected by setting flags in the TCollationMethod object.

Note that when strings match at level 3, they do not necessarily have the same binary representation, or even the same length. Unicode contains many strings that are regarded as equivalent, even though they have different binary representations.