Quantcast
Channel: 2,000 Things You Should Know About C# » Surrogate Pairs
Viewing all articles
Browse latest Browse all 2

#1,007 – Getting Length of String that Contains Surrogate Pairs

$
0
0

You can use the string.Length property to get the length (number of characters) of a string.  This only works, however, for Unicode code points that are no larger than U+FFFF.  This set of code points is known as the Basic Multilingual Plane (BMP).

Unicode code points outside of the BMP are represented in UTF-16 using 4 byte surrogate pairs, rather than using 2 bytes.

To correctly count the number of characters in a string that may contain code points higher than U+FFFF, you can use the StringInfo class (from System.Globalization).

            // 3 Latin (ASCII) characters
            string simple = "abc";

            // 3 character string where one character
            //  is a surrogate pair
            string containsSurrogatePair = "A𠈓C";

            // Length=3 (correct)
            Console.WriteLine(string.Format("Length 1 = {0}", simple.Length));

            // Length=4 (not quite correct)
            Console.WriteLine(string.Format("Length 2 = {0}", containsSurrogatePair.Length));

            // Better, reports Length=3
            StringInfo si = new StringInfo(containsSurrogatePair);
            Console.WriteLine(string.Format("Length 3 = {0}", si.LengthInTextElements));

1007-001


Filed under: Strings Tagged: C#, Length, Strings, Surrogate Pairs, Unicode, UTF-16

Viewing all articles
Browse latest Browse all 2

Latest Images

Trending Articles





Latest Images