How to Read Data From an Api Into Perl
Character Encodings in Perl
This article describes the unlike character encodings, how they may lead to bug, and how they can be handled in Perl programs.
German language and French versions be too.
Introduction
It happens far as well often: a program works fine with latin characters, simply information technology produces weird, unreadable characters equally soon as it has to process other characters similar Chinese or Japanese characters or modified latin characters like the German Umlauts Ä, Ö etc. or the Scandinavian characters å and Ø.
ASCII
To understand the root of the problem y'all take to sympathize how "normal" Latin characters and other characters (the ones that cause problems) are stored.
Information technology all began in the year 1963 with ASCII, the "American Standard for Information Interchange". It maps 128 characters to the number from 0 to 127, which can be encoded with 7 bits.
Since a byte contains 8 bits, the first, "most meaning" bit in ASCII characters is always naught.
The standard defines the Latin letters a
to z
in both upper and lower example, the Standard arabic digits 0
to 9
, whitespace like "blank" and "carriage render", a few control characters and a few special signs like %
, $
then on.
Characters that aren't essential in the mean solar day to twenty-four hour period life of an American citizen are not defined in ASCII, like Cyrillic letters, "decorated" Latin characters, Greek characters and so on.
Other Character Encodings
When people started to employ computers in other countries, other characters needed to exist encoded. In the European countries ASCII was reused, and the 128 unused numbers per byte were used for the locally needed characters.
In Western Europe the graphic symbol encoding was chosen "Latin 1", and later standardized every bit ISO-8859-1. Latin ii was used in cardinal Europe and so on.
In each of the Latin-* charsets the start 128 characters are identical to ASCII, so they tin be viewed as ASCII extensions. The 2nd 128 byte values are each mapped to characters needed in the regions where these character sets were used.
In other parts of world other graphic symbol encodings were adult, like EUC-CN in Red china and Shift-JIS in Nihon.
These local charsets are very limited. When the Euro was introduced in 2001, many European countries had a currencies whose symbols couldn't be expressed in the traditional character encodings.
Unicode
The charsets mentioned so far can encode only a small part of all possible characters, which makes it most incommunicable to create documents that contain letters from different scripts.
In an endeavor to unify all scripts into a single writing organisation, the Unicode consortium was created, and it started to collect all known characters, and assign a unique number to each, called a "codepoint".
The codepoint is unremarkably written as a four or six digit hex number, similar U+0041
. The corresponding name is LATIN Small-scale Alphabetic character A
.
Apart from messages and other "base characters", at that place are also accents and decorations like Emphasis, COMBINING Astute
, which can exist added to a base grapheme.
If a base of operations char is followed by one or more of these marker characters, this compound forms a logical character called "grapheme".
Note that many pre-composed graphemes exist for characters that are defined in other character sets, and these pre-composed are typically better supported by current software than the equivalent written every bit base of operations character and combining mark.
Unicode Transformation Formats
The concept of Unicode codepoints and graphemes is completely independent of the encoding.
In that location are different means to encode these codepoints, and these mappings from codepoints to bytes are called "Unicode Transformation Formats". The near well known is UTF-viii, which is a byte based format that uses all possible byte values from 0 to 255. In Perl land at that place is too a lax version chosen UTF8 (without the hyphen). The Perl module Encode distinguishes these versions.
Windows uses mostly UTF-xvi which uses at least 2 bytes per codepoint, for very high codepoints it uses iv bytes. There are ii variants of UTF-16, which are marked with the suffix LE
for "niggling endian" and -BE
for "big endian" (see Endianess).
UTF-32 encodes every codepoint in four bytes. It is the only stock-still width encoding that tin implement the whole Unicode range.
Codepoint | Char | ASCII | UTF-8 | Latin-1 | ISO-8859-15 | UTF-16 |
---|---|---|---|---|---|---|
U+0041 | A | 0x41 | 0x41 | 0x41 | 0x41 | 0x00 0x41 |
U+00c4 | Ä | - | 0xc3 0x84 | 0xc4 | 0xc4 | 0x00 0xc4 |
U+20AC | € | - | 0xe3 0x82 0xac | - | 0xa4 | 0x20 0xac |
U+c218 | 수 | - | 0xec 0x88 0x98 | - | - | 0xc2 0x18 |
(The letter of the alphabet in the terminal line is the Hangul syllable SU, and your browser volition only display information technology correctly if you have the appropriate Asian fonts installed.)
Unicode defines a character repertoire of codepoints and their properties. Character encodings like UTF-eight and UTF-xvi ascertain a way to write them as a short sequence of bytes.
Perl v and Character Encodings
Perl Strings can either be used to hold text strings or binary information. Given a string, y'all generally have no mechanism of finding out whether it holds text or binary data - y'all have to keep track of information technology yourself.
Interaction with the surround (like reading data from STDIN
or a file, or printing it) treats strings as binary data. The same holds true for the return value of many congenital-in functions (like gethostbyname
) and special variables that behave information to your plan (%ENV
and @ARGV
).
Other builtin functions that deal with text (like uc
and lc
and regular expressions) treat strings as text, or more than accurately equally a listing of Codepoints.
With the function decode
in the module Encode you lot decode binary strings to brand sure that the text handling functions work correctly.
All text operations should work on strings that have been decoded by Encode::decode
(or in other ways described below). Otherwise the text processing functions assume that the string is stored equally Latin-1, which will yield incorrect results for any other encoding.
Note that cmp
only compares not-ASCII chars by codepoint number, which might requite unexpected results. In general the ordering is linguistic communication dependent, so that you need utilise locale
in effect to sort strings co-ordinate the rules of a natural language. For case, in German language the desired ordering is 'a' lt 'ä' and 'ä' lt 'b'
, whereas comparison by codepoint number gives 'ä' gt 'b'
.
#!/usr/bin/perl use warnings; utilize strict; use Encode qw(encode decode); my $enc = 'utf-8'; my $str = "Ä \n "; print lc $str; my $text_str = decode($enc, $byte_str); $text_str = lc $text_str; print encode($enc, $text_str);
It is highly recommended to convert all input to text strings, then work with the text strings, and only covert them back to byte strings on output or storing.
Otherwise, yous can become confused very fast, and lose track of which strings are byte strings, and which ones are text strings.
Perl offers IO layers, which are easy mechanisms to make these conversions automatically, either globally or per file handle.
open up my $handle, '<:encoding(UTF-8)', $file; open up my $handle, '<', $datei; binmode $handle, ':encoding(UTF-8)'; use open up ':encoding(iso-8859-1)'; use utf8; utilise PerlIO::locale; binmode STDOUT, ':locale'; use open up ':locale';
Intendance should exist taken with the input layer :utf8
, which often pops upward in instance lawmaking and old documentation: information technology assumes the input to be in valid UTF-8, and you have no mode of knowing in your program if that was actually the case. If not, it'due south a source of subtle security holes, see this article on perlmonks.org for details. Don't always employ it every bit an input layer, use :encoding(UTF-viii)
instead.
The module and pragma utf8
also allows yous to use non-ASCII chars in variable names and module names. But beware, don't practice this for package and module names; it might non work well. Also, consider that not everybody has a keyboard that allows piece of cake typing of non-ASCII characters, so you make maintenance of your lawmaking much harder if you lot apply them in your code.
Testing your Environment
You can use the following short script to your terminal, locales and fonts. It is very European centric, just you should be able to change it to utilize the character encodings that are normally used where you live.
#!/usr/bin/perl use warnings; use strict; apply Encode; my @charsets = qw(utf-8 latin1 iso-8859-15 utf-16); my $test = 'Ue: ' . chr(220) .'; Euro: '. chr(8364) . " \n "; for (@charsets){ print " $_ : " . encode($_, $examination); }
If you run this programme in a last, only 1 line will be displayed correctly, and its first column is the grapheme encoding of your last.
The Euro sign €
isn't in Latin-1, so if your terminal has that encoding, the Euro sign won't be displayed correctly.
Windows terminals generally use cp*
encodings, for example cp850
or cp858
(only available in new versions of Encode) for German windows installations. The rest of the operating environment uses Windows-*
encodings, for example Windows-1252
for a number of Western European localizations. Encode->encodings(":all");
returns a listing of all available encodings.
Troubleshooting
"Wide Graphic symbol in print"
Sometimes y'all might see the Wide character in impress
alarm.
This ways that you tried to use decoded string data in a context where information technology only makes sense to have binary data, in this case printing it. You tin make the warning go away by using an appropriate output layer, or by piping the offending string through Encode::encode
first.
Inspecting Strings
Sometimes y'all want to inspect if a string from an unknown source has already been decoded. Since Perl has no separate data types for binary strings and decoded strings, y'all can't do that reliably.
But there is a way to guess the answer past using the module Devel::Peek
use Devel::Peek; use Encode; my $str = "ä"; Dump $str; $str = decode("utf-8", $str); Dump $str; Dump encode('latin1', $str); __END__ SV = PV(0x814fb00) at 0x814f678 REFCNT = ane FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x81654f8 "\303\244"\0 CUR = 2 LEN = iv SV = PV(0x814fb00) at 0x814f678 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x817fcf8 "\303\244"\0 [UTF8 "\x{e4}"] CUR = 2 LEN = 4 SV = PV(0x814fb00) at 0x81b7f94 REFCNT = 1 FLAGS = (TEMP,POK,pPOK) PV = 0x8203868 "\344"\0 CUR = 1 LEN = four
The string UTF8
in the line starting with FLAGS =
shows that the string has been decoded already. The line starting with PV =
holds the bytes, and in brackets the codepoints.
Merely there is a large caveat: Only considering the UTF8
flag isn't present doesn't hateful that the text cord hasn't been decoded. Perl uses either Latin-1 or UTF-viii internally to shop strings, and the presence of this flag indicates which one is used.
That as well implies that if your plan is written in Perl only (and has no XS components) information technology is well-nigh certainly an error to rely on the presence or absence of that flag. You lot shouldn't care how perl stores its strings anyway.
Buggy Modules
A common source of errors are buggy modules. The pragma encoding
looks very tempting:
use encoding ':locale';
But under the effect of apply encoding
, some AUTOLOAD functions stop working, and the module isn't thread safe.
Character Encodings in the Www
When you write a CGI script you have to chose a character encoding, print all your information in that encoding, and write it in the HTTP headers.
For most applications, UTF-eight is a good choice, since you tin can lawmaking arbitrary Unicode codepoints with it. On the other hand English text (and of virtually other European languages) is encoded very efficiently.
HTTP offers the Accept-Charset
-Header in which the client can tell the server which grapheme encodings it can handle. Merely if you stick to the common encodings similar UTF-eight or Latin-1, next to all user agents volition empathise it, then it isn't actually necessary to cheque that header.
HTTP headers themselves are strictly ASCII only, so all information that is sent in the HTTP header (including cookies and URLs) demand to exist encoded to ASCII if non-ASCII characters are used.
For HTML files the header typically looks similar this: Content-Type: text/html; charset=UTF-8
. If y'all send such a header, you just take to escape those characters that accept a special meaninig in HTML: <
, >
, &
and, in attributes, "
.
Special care must exist taken when reading POST or Become parameters with the function param
in the module CGI
. Older versions (prior to three.29) always returned byte strings, newer version return text strings if charset("UTF-8")
has been called earlier, and byte strings otherwise.
CGI.pm also doesn't back up character encodings other than UTF-viii. Therefore yous should not to utilize the charset
routine and explicitly decode the parameter strings yourself.
To ensure that form contents in the browser are sent with a known charset, you lot tin add the accept-charset
attribute to the <class>
tag.
< form method = "post" accept - charset = "utf-8" activity = "/script.pl" >
If you use a template system, yous should take care to cull one that knows how to handle character encodings. Skilful examples are Template::Alloy, HTML::Template::Compiled (since version 0.90 with the open_mode
option), or Template Toolkit (with the ENCODING
pick in the constructor and an IO layer in the process
method).
Modules
There are a plethora of Perl modules out there that handle text, so here are simply a few notable ones, and what you have to do to make them Unicode-aware:
LWP::UserAgent and Www::Mechanize
Use the $response->decode_content
instead of just $response->content
. That way the character encoding data sent in the HTTP response header is used to decode the trunk of the response.
DBI
DBI leaves handling of character encodings to the DBD:: (driver) modules, and then what you take to exercise depends on which database backend you are using. What most of them have in common is that UTF-8 is better supported than other encodings.
For Mysql and DBD::mysql laissez passer the mysql_enable_utf8 => i
option to the DBI->connect
call.
For Postgresql and DBD::Pg, fix the pg_enable_utf8
attribute to one
For SQLite and DBD::SQLite, set the sqlite_unicode
attribute to one
Advanced Topics
With the basic charset and Perl noesis you lot can become quite far. For example, you can make a web application "Unicode safe", i.e. you tin can accept care that all possible user inputs are displayed correctly, in any script the user happens to employ.
But that'due south non all there is to know on the topic. For example, the Unicode standard allows dissimilar ways to compose some characters, so you need to "normalize" them before you can compare two strings. You can read more about that in the Unicode normalization FAQ.
To implement country specific behaviour in programs, you should take a look at the locales system. For example in Turkey lc 'I'
, the lower example of the capital letter I is ı, U+0131 LATIN SMALL LETTER DOTLESS I
, while the upper case of i
is İ, U+0130 LATIN Capital letter Letter of the alphabet I WITH DOT Higher up
.
A good identify to start reading well-nigh locales is perldoc perllocale
.
Philosophy
Many programmers who are confronted with encoding issues start react with "But shouldn't information technology just piece of work?". Yes, it should just work. Merely likewise many systems are broken past design regarding character sets and encodings.
Cleaved by Design
"Broken by Design" almost of the time means that a document format, and API or a protocol allows multiple encodings, without a normative mode on how that encoding information is transported and stored out of ring.
A classical example is the Net Relay Chat (IRC), which specifies that a character is one Byte, but not which grapheme encoding is used. This worked well in the Latin-ane days, simply was leap to fail equally soon as people from different continents started to employ it.
Currently, many IRC clients attempt to autodetect graphic symbol encodings, and recode it to what the user configured. This works quite well in some cases, but produces actually ugly results where it doesn't work.
Another Example: XML
The Extensible Markup Language, commonly known past its abbreviation XML, lets you specific the character encoding within the file:
xml version="1.0" encoding="UTF-8"
In that location are two reasons why this is insufficient:
- The encoding information is optional. The specification clearly states that the encoding must exist UTF-8 if the encoding data is absent-minded, but sadly many tool authors don't seem to know that, stop emit Latin-1. (This is of course only partly the fault of the specification).
- Any XML parser kickoff has to autodetect the encoding to be able to parse the encoding information
The second betoken is really important. You'd approximate "Ah, that's no problem, the preamble is just ASCII" - but many encodings are ASCII-incompatible in the first 127 bytes (for case UTF-7, UCS-2 and UTF-16).
So although the encoding information is available, the parser first has to judge nearly correctly to extract it.
The appendix to the XML specification contains a detection algorithm than can handle all common cases, but for example lacks UTF-7 back up.
How to Do information technology Right: Out-of-band Signaling
The XML instance above demonstrates that a file format can't carry encoding information in the file itself, unless you specify a way to carry that encoding data on the byte level, independently of the encoding of the rest of the file.
A possible workaround could have been to specific that the showtime line of any XML file has to be ASCII encoded, and the residue of the file is in the encoding that is specified in that first line. But it's an ugly workaround: a normal text editor would brandish the first line completely wrong if the file is in an ASCII-incompatible encoding. Of grade information technology's also incompatible with current XML specification, and would require a new, incompatible specification, which would in turn suspension all existing applications.
And then how to practise it right, and so?
The answer is quite simple: Every system that works with text information has to either store meta data separately, or store everything in a compatible encoding.
It is tempting to store everything in the same encoding, and it works quite well on a local machine, merely you can't expect everyone to agree on ane single encoding, so all information commutation however has to carry encoding information. And ordinarily you want to shop original files (for fear of information loss), so yous have to keep that encoding information somewhere.
This observation should have a huge impact on the computing world: all file systems should allow yous to store encoding information as meta data, and easily retrieve that meta information. The same should hold truthful for file names, and programming languages (at least those who want to take the pain abroad from their users) should transparently transport that meta data, and have intendance of all encoding issues.
And so it could just work.
Further reading
- W3c tutorial on graphic symbol encodings in HTML and CSS
- Perl Programming/Unicode UTF-8 wikibook
- perlunitut, the Perl Unicode Tutorial
- gucharmap, the Gnome Unicode character map.
- An UTF-8 dumper that shows you lot the name of non-ASCII characters.
- hexdump never lies (on Debian it's in the
bsdmainutils
parcel). - iconv converts text files from 1 character encoding to another.
Acknowledgments
This commodity is a translation of a German language article of mine written for $foo-Magazin 01/2008, a High german Perl magazine. It was enhanced and corrected since so.
Many thanks go to Juerd Waalboer, who pointed out many smaller and a few non-so-small errors in earlier versions of this article, and contributed profoundly to my understanding of Perl's string handling.
I'd also like to thank ELISHEVA for suggesting many improvements to both grammar and spelling.
I'd like to admit insightful discussions with:
- ikegami
- Aristotle Pagaltzis
Source: https://perlgeek.de/en/article/encodings-and-unicode
0 Response to "How to Read Data From an Api Into Perl"
Enregistrer un commentaire