for loop - Perl questions regarding unpack() and the v flag in printf() -
i trying accomplish following:
for arbitrary perl string (whether or not internally encoded in utf-8, , whether or not has utf-8 flag set), scan string left right, , every character, print unicode code point character in hex format. make myself absolutely clear: not want print utf-8 byte sequences or something; print unicode code point every character in string.
at first, have come following solution:
#!/usr/bin/perl -w use warnings; use utf8; use feature 'unicode_strings'; binmode(stdout, ':encoding(utf-8)'); binmode(stdin, ':encoding(utf-8)'); binmode(stderr, ':encoding(utf-8)'); $text = "\x{3b1}\x{3c9}"; print $text."\n"; printf "%vx\n", $text; # prints following console (the console utf8): # αω # 3b1.3c9
then have seen examples, without reasonable explanations, made me doubt solution correct, , have got questions regarding own solution examples.
1) perl's documentation v flag in (...)printf says:
"this flag tells perl interpret supplied string vector of integers, 1 each character in string. [...]"
it not means "a vector of integers", though. when looking @ output of example, seems integers unicode code points, have confirmed knows sure.
hence question:
1) can sure every integer pulled string way respective character's unicode code point (and not other byte sequence)?
secondly, regarding example have found (slightly modified; can't remember got from, maybe perl docs):
#!/usr/bin/perl -w use warnings; use utf8; use feature 'unicode_strings'; binmode(stdout, ':encoding(utf-8)'); binmode(stdin, ':encoding(utf-8)'); binmode(stderr, ':encoding(utf-8)'); $text = "\x{3b1}\x{3c9}"; print $text."\n"; printf "%vx\n", $text unpack('c0a*', $text); # prints following console (the console utf8): # αω # 3b1.3c9
being c , assembly guy, don't why write printf
statement shown in example. according understanding, respective line syntactically equivalent to:
for $_ (unpack('c0a*', $text)) { printf "%vx\n", $text; }
as far have understood, unpack()
takes $text
, unpacks (whatever means in detail) , returns list in case has 1 element, namely unpacked string. $_ runs through list 1 element (without being used anywhere), hence block (i.e. printf()
) executed once. in summary, action done above snippet executing printf "%vx\n", $text;
1 time.
hence question:
2) reason wrapping loop shown in example?
final questions:
3) if answer question 1) "yes", why examples have seen use unpack()
after all?
4) in 3 line snippet above, parentheses surround unpack()
necessary (leaving them away leads syntax errors). in contrast, in example, unpack()
not need enclosed in parentheses (but not harm if added nevertheless). explain reason?
edit / update in reply ikegami's answer below:
of course, know strings sequences of integers. but
a) there many different encodings integers, , bytes in string's memory area depend on encoding, i.e. if have 2 strings contain same character sequence, store them in memory using different encodings, byte sequences @ strings' memory locations different.
b) suppose (besides unicode) there many other systems / standards map characters integers / code points. example, unicode code point 0x3b1 greek letter α, in other system, may german letter Ö.
under these circumstances, question makes perfect sense imho, possibly should more precise , reword it:
if have string $text
contains characters unicode code points, , if execute printf "%vx\n", $text;
, print unicode code point in hex every character under circumstances, notably (but not limited to):
- regardless of perl's actual internal encoding of string
- regardless of string's utf-8 flag
- whether or not
use 'unicode_strings'
active
if answer yes, sense examples make using unpack()
, notably example above? way, have remembered got 1 from: original form in perl's pack()
documentation, in section c0 , u0 mode. since using unpack()
, there must reason doing so.
edit / update no. 2
i have done further research. following proves utf8 flag plays important role:
use encode; use devel::peek; $text = "\x{3b1}\x{3c9}"; dump $text; printf("\nsprintf: %vx\n", $text); print("utf8 flag: ".((encode::is_utf8($text)) ? "true" : "false")."\n\n"); encode::_utf8_off($text); dump $text; printf "\nsprintf: %vx\n", $text; print("utf8 flag: ".((encode::is_utf8($text)) ? "true" : "false")."\n\n"); # prints following lines: # # sv = pv(0x1750c20) @ 0x1770530 # refcnt = 1 # flags = (pok,ppok,utf8) # pv = 0x17696b0 "\316\261\317\211"\0 [utf8 "\x{3b1}\x{3c9}"] # cur = 4 # len = 16 # # sprintf: 3b1.3c9 # utf8 flag: true # # sv = pv(0x1750c20) @ 0x1770530 # refcnt = 1 # flags = (pok,ppok) # pv = 0x17696b0 "\316\261\317\211"\0 # cur = 4 # len = 16 # # sprintf: ce.b1.cf.89 # utf8 flag: false
we can see _utf_off
indeed removes utf8 flag, leaves string's bytes untouched. sprintf()
v flag outputs different results, solely dependent on string's utf8 flag if string's bytes remain same.
sprintf '%vx'
has no knowledge of code points or utf-8. returns string representation of characters of string. in other words,
sprintf('%vx', $s)
is equivalent to
join('.', map { sprintf('%x', ord($_)) } split(//, $s))
that means output s[0]
, s[1]
, s[2]
, ..., s[length(s)-1]
, in hex, separated dots.
it returns characters (integers) of string regardless of state of utf8
flag. means how string stored (e.g. whether utf8
flag set or not) has no effect on output.
use encopde; $text1 = "\xc9ric"; utf8::downgrade($text2); printf("text1 string of %1\$d characters (a vector of %1\$d integers)\n", length($text1)); print("utf8 flag: ".((encode::is_utf8($text2)) ? "true" : "false")."\n"); printf("sprintf: %vx\n\n", $text1); $text2 = $text1; utf8::upgrade($text2); print($text1 eq $text2 ? "text2 identical text1\n\n" : "text2 differs text1\n\n"); printf("text2 string of %1\$d characters (a vector of %1\$d integers)\n", length($text2)); print("utf8 flag: ".((encode::is_utf8($text2)) ? "true" : "false")."\n"); printf "sprintf: %vx\n\n", $text2;
output:
text1 string of 4 characters (a vector of 4 integers) utf8 flag: false sprintf: c9.72.69.63 text2 identical text1 text2 string of 4 characters (a vector of 4 integers) utf8 flag: true sprintf: c9.72.69.63
let's change code in question show relevant information:
use encode; $text1 = "\x{3b1}\x{3c9}"; printf("text1 string of %1\$d characters (a vector of %1\$d integers)\n", length($text1)); printf("sprintf: %vx\n\n", $text1); $text2 = $text1; encode::_utf8_off($text2); print($text1 eq $text2 ? "text2 identical text1\n\n" : "text2 differs text1\n\n"); printf("text2 string of %1\$d characters (a vector of %1\$d integers)\n", length($text2)); printf "sprintf: %vx\n\n", $text2;
output:
text1 string of 2 characters (a vector of 2 integers) sprintf: 3b1.3c9 text2 differs text1 text2 string of 4 characters (a vector of 4 integers) sprintf: ce.b1.cf.89
it shows sprintf '%vx'
have different output different strings, no surprise, since sprintf '%vx'
outputs characters of string. easly have used uc
instead of _utf8_off
.
- if 2 identical strings,
sprintf '%vx'
altererd output based onutf8
flag, considered suffer the unicode bug. instances of has been fixed (thoughsprintf
never suffered bug).
Comments
Post a Comment