The Charset module supports a wide variety of different character sets, and
it is flexible in regard of the names of character sets it accepts. The
character case is ignored, as are the most common non-alaphanumeric
characters appearing in character set names. E.g. "iso-8859-1"
works just as well as "ISO_8859_1"
. All encodings specified in
RFC 1345 are supported.
First of all the Charset module is capable of handling the following encodings of Unicode:
UTF encodings
Most, if not all, of the relevant code pages are represented, as the following list shows. Prefix the numbers as noted in the list to get the wanted codec:
These may be prefixed with "cp"
, "ibm"
or
"ms"
.
These may be prefixed with "cp"
, "ibm"
,
"ms"
or "windows"
The default charset in MySQL, similar to cp1252
.
+359 more.
In Pike 7.8 and earlier this module was named Locale.Charset
.
void
decode_error(string
err_str
, int
err_pos
, string
charset
, void
|string
reason
, mixed
... args
)
Throws a DecodeError
exception. See DecodeError.create
for
details about the arguments. If args
is given then the error
reason is formatted using sprintf(
.reason
, @args
)
Decoder
decoder(string
|zero
name
)
Returns a charset decoder object.
name
The name of the character set to decode from. Supported charsets include (not all supported charsets are enumerable): "iso_8859-1:1987", "iso_8859-1:1998", "iso-8859-1", "iso-ir-100", "latin1", "l1", "ansi_x3.4-1968", "iso_646.irv:1991", "iso646-us", "iso-ir-6", "us", "us-ascii", "ascii", "cp367", "ibm367", "cp819", "ibm819", "iso-2022" (of various kinds), "utf-7", "utf-8" and various encodings as described by RFC 1345.
If the asked-for name
was not supported, an error is thrown.
Decoder
decoder_from_mib(int
mib
)
Returns a decoder for the encoding schema denoted by MIB mib
.
void
encode_error(string
err_str
, int
err_pos
, string
charset
, void
|string
reason
, mixed
... args
)
Throws an EncodeError
exception. See EncodeError.create
for
details about the arguments. If args
is given then the error
reason is formatted using sprintf(
.reason
, @args
)
Encoder
encoder(string
|zero
name
, string
|void
replacement
, function
(string
:string
)|void
repcb
)
Returns a charset encoder object.
name
The name of the character set to encode to. Supported charsets include (not all supported charsets are enumerable): "iso_8859-1:1987", "iso_8859-1:1998", "iso-8859-1", "iso-ir-100", "latin1", "l1", "ansi_x3.4-1968", "iso_646.irv:1991", "iso646-us", "iso-ir-6", "us", "us-ascii", "ascii", "cp367", "ibm367", "cp819", "ibm819", "iso-2022" (of various kinds), "utf-7", "utf-8" and various encodings as described by RFC 1345.
replacement
The string to use for characters that cannot be represented in
the charset. It's used when repcb
is not given or when it returns
zero. If no replacement string is given then an error is thrown
instead.
repcb
A function to call for every character that cannot be
represented in the charset. If specified it's called with one
argument - a string containing the character in question. If it
returns a string then that one will replace the character in the
output. If it returns something else then the replacement
argument will be used to decide what to do.
If the asked-for name
was not supported, an error is thrown.
Encoder
encoder_from_mib(int
mib
, string
|void
replacement
, function
(string
:string
)|void
repcb
)
Returns an encoder for the encoding schema denoted by MIB mib
.
string
|zero
normalize(string
|zero
in
)
All character set names are normalized through this function before compared.
void
set_decoder(string
name
, program
decoder
)
Adds a custom defined character set decoder. The name is
normalized through the use of normalize
.
void
set_encoder(string
name
, program
encoder
)
Adds a custom defined character set encoder. The name is
normalized through the use of normalize
.
Base class for errors thrown by the Charset
module.
inherit Error.Generic : Generic
Error thrown when decode fails (and no replacement char or replacement callback has been registered).
This error class is not actually used by this module yet - decode errors are still thrown as untyped error arrays. At this point it exists only for use by other modules.
inherit CharsetGenericError : CharsetGenericError
string
Charset.DecodeError.charset
The decoding charset, typically as known to
Charset.decoder
.
Other code may produce errors of this type. In that case this
name is something that Charset.decoder
does not accept
(unless it implements exactly the same charset), and it should
be reasonably certain that Charset.decoder
never accepts that
name in the future (unless it is extended to implement exactly
the same charset).
int
Charset.DecodeError.err_pos
The failing position in err_str
.
string
Charset.DecodeError.err_str
The string that failed to be decoded.
Virtual base class for charset decoders.
string win1252_to_string( string data ) { return Charset.decoder("windows-1252")->feed( data )->drain(); }
string
Charset.Decoder.charset
Name of the charset - giving this name to decoder
returns an
instance of the same class as this object.
This is not necessarily the same name that was actually given to
decoder
to produce this object.
this_program
clear()
Clear buffers, and reset all state.
Returns the current object to allow for chaining of calls.
string
drain()
Get the decoded data, and reset buffers.
Returns the decoded string.
this_program
feed(string
s
)
Feeds a string to the decoder.
s
String to be decoded.
Returns the current object, to allow for chaining of calls.
Error thrown when encode fails (and no replacement char or replacement callback has been registered).
This error class is not actually used by this module yet - encode errors are still thrown as untyped error arrays. At this point it exists only for use by other modules.
inherit CharsetGenericError : CharsetGenericError
string
Charset.EncodeError.charset
The encoding charset, typically as known to
Charset.encoder
.
Other code may produce errors of this type. In that case this
name is something that Charset.encoder
does not accept
(unless it implements exactly the same charset), and it should
be reasonably certain that Charset.encoder
never accepts that
name in the future (unless it is extended to implement exactly
the same charset).
int
Charset.EncodeError.err_pos
The failing position in err_str
.
string
Charset.EncodeError.err_str
The string that failed to be encoded.
Virtual base class for charset encoders.
inherit Decoder : Decoder
An encoder only differs from a decoder in that it has an extra function.
string
Charset.Encoder.charset
Name of the charset - giving this name to encoder
returns
an instance of the same class as this one.
This is not necessarily the same name that was actually given to
encoder
to produce this object.
this_program
set_replacement_callback(function
(string
:string
) rc
)
Change the replacement callback function.
rc
Function that is called to encode characters outside the current character encoding.
Returns the current object to allow for chaining of calls.
Codec for the ISO-8859-1 character encoding.