Centura Unicode Support

admin · 發表於 2016-9-1 15:40:35

Centura Unicode Support

A common question raised for the Centura client is in regard to it's unicode capabilities. Unicode is a text encoding standard for the computer industry that defines a large number of characters (over 100,000) in most of the worlds languages into one consistent system.

This wiki article attempts to shed some light on this issue. This article attempts to frame as much of the issue into a single source of information as possible, so it might at times be a bit overly pedantic. Feel free to skip sections you find obvious.

Character Sets

Character Sets are a way of representing a set of human readable glyphs (characters) in a computer’s memory. What we understand as characters like “A”, “1” or punctionation like “.” are to a machine different kinds of binary numbers. This is because machines store all information as binary numbers internally. There’s more than one way to represent numbers in binary, so you end up with multiple types of character encoding. The most common form still in use today is ASCII (American Standard Code for Information Interchange). In ASCII for example, the character ‘A’ is decimal 65 (decimal is the numbering system that we use in general, it uses base 10 – a natural base for a species with 10 fingers) and hex (hexadecimal – base 16) 0x41. ASCII codes (or characters) are each stored in 1 byte of computer memory. In general that’s 8 bit’s (though the number of bits per byte are dependent on the computing architecture, we assume Intel’s x86 – or IA32 as they call it now).

The fact that a character is encoded in 8 bits only has the implication that you can only encode a limited number of characters in the ASCII character set. This is 2^8 (2 to the power of 8) characters, because each bit can be either on or off and this means 2 states per bit, into the power of the number of bits (8). The limit on ASCII characters is 256. This is enough for your basic English alphabet and numbers along with punctuation and a couple of special characters for things like line endings. Operating systems can also map different glyphs of characters into ASCII, enabling English like languages that have special characters under the limitation of 256 characters to be encoded in ASCII.

Asian scripts like Mandarin, Sinhalese or Tamil however contain a larger number of complex characters that will not fit into the 256 character limitation. This caused problems early on in computing when trying to support such complex languages in operating systems and programs. For a time, the gap was filled by using a “Multi Byte” system of ASCII characters along with operating system level “paging” of glyphs. I.e. an operating system could support a larger number of characters if some characters could be encoded with more than 1 byte, i.e. 2 ASCII characters. So a complex character could be defined as 2 bytes, and the operating system could be put into the “page” of the character set being rendered (drawn) in order to properly identify and render the glyph for the character, but this method had its limitations. The primary one is that you couldn’t have multiple complex languages running in the OS at the same time, because the OS had to page from a single language for manageability sake.

Then came Unicode. Unicode is a text encoding standard for the computer industry that defines a large number of characters (over 100,000) in most of the worlds languages into one consistent system. Unicode also defines a set of character encoding standards for encoding all those defined characters. When you hear terms like UTF-8, UTF-16 your actually hearing how the characters are encoded, and not really “which Unicode standard”.

There’s 1 standard to define them all and in the encodings to write them. J

Unicode in Windows

When the first version of windows was written before the Unicode standard was agreed upon, and this meant that all characters were encoded in ASCII. There was limited functionality in windows for multiple byte character sets. Prior to windows NT 3.1 all versions of windows used ASCII internally to represent characters. Multiple language and complex script were supported by using the multi byte ASCII encoding mentioned earlier. When Unicode came along Microsoft had a problem on their hands. Many programs were already written for their Win16 API (which is a set of libraries that applications used to get services from the OS). They could not just change those api’s to use uinicode characters in one of the encodings instead of Multibyte characters because they would then break many of the programs already written.

So when Dave Cutler and the gang started working on the Win32 API (The 32 bit successor to Win16) late in NT’s implementation they had a cunning plan. They kept all the old functions that used mutibyte characters and they created duplicate functions that used Unicode characters. Then they created C #defines that routed API calls to the correct function depending on the character set a programmer wanted to use at compile time. It looked a bit like this:

#ifndef Unicode
#define SetWindowText SetWindowTextA
#elseif
#define SetWindowText SetWindowTextW
#endif

Here the function with the A suffix is the ASCII or the multibyte function. The function with the W suffix is the wide character or Unicode fiunction. At compile time a programmer selected to define the UNICODE pre processor variable if he so wishes and calls the generic function SetWindowText. This would cause the functions to get rerouted to the Unicode version of the functions. The implication of this is that a program compiled for windows can be either Unicode or multibyte, but not both.

It should be noted however that the internal undocumented fountions of the NT core actually use Unicode natively. This means that when you call a “A” function in windows that the call is translated into Unicode and routed into the “W” function.

There’s a number of complex string type definitions that allow programmers to work with Unicode or multi byte strings transparently, and these are created with the same kind opf mechanism used on the functions. The explanation of these types are out of the scope of this document.

Centura and Unicode

Centura is compiled as an ASCII program. I.e. it uses a multibyte character set and the multibyte versions of the Win32 functions. This holds for all versions of Centura applications be it 2003-2, 2004-1, App7 or App75.

App7 and App75 versions of the Centura client however can connect to a Unicode database. This is done through a mechanism provided by oracle that allows the client to run a different character set than the server. What happens in this case is that the Oracle Client Interface (OCI) converts Unicode characters into their equivalent multibyte characters when bringing the data over from the server. This means that Centura see’s all the data as a multi byte character set.

Some implications follow:

[1] Within Centura, the character glyphs will be rendered by using the code page in windows. This option is set by setting the Language for non Unicode programs in the Regional and Language Options in Windows.

[2] A special environment variable called NLS_LANG needs to be set on the client in order for the OCI router to understand what the character set on the client side is.

[3] You cannot run multiple languages at once within the Centura client.

[4] Localization support in the Centura client is dependent on the support for the multibyte script required in windows. This works well for most European languages. Asian languages like Vietnamese have limited within Centura.

Windows Common Controls and Custom Control

Controls like datafields and list boxes in centura are based on the windows common controls librarv – which means that windows is responsible for the drawing of the controls and the user interaction. This library is available in all versions of windows and supports both Unicode and multibyte characters sets. Centura uses the multibte characters versions of the functions in this library. In general localization support within this library is good, with a few notable bugs (looks in the bugs section). This means that controls used within centura from the common controls library should have relatively good support for most multibyte languages including complex scripts.

Other controls like table windows in centura are “custom”, in other words centura is responsible for both the rendering of the controls and the user interaction. While the localization support for these controls in general are good, they are not perfect. Centura has known issues in table windows with complex asian languages like Vietnamese.

		自動登錄	找回密碼
密碼			立即註冊