$ vi josh.txt
What I saw in
vi
is,
Josh
Anonymous
~
~ ~
OK, let's
grep
something ...
$ grep "Josh" josh.txt
$ echo $?
1
Should I have seen a match and exit-code 0 instead? I haven't gotten a clue until I ran
strace
,
$ strace grep "Josh" josh.txt
...
openat(AT_FDCWD, "josh.txt", O_RDONLY|O_NOCTTY) = 3
fstat(3, {st_mode=S_IFREG|0777, st_size=42, ...}) = 0
read(3, "\377\376 \0 \0J\0o\0s\0h\0 \0\r\0\n\0 \0 \0A\0n\0o\0n\0"..., 98304) = 42
read(3, "", 98304) = 0
close(3) = 0
...
$
Good, I saw '
J
', 'o
, ..., but what are these '\377
', '\376
', ... Instead of doing octal numbers to hexadecimal number conversion, I let strace
do this for me, and
$ strace grep "Josh" josh.txt
...
openat(AT_FDCWD, "josh.txt", O_RDONLY|O_NOCTTY) = 3
fstat(3, {st_mode=S_IFREG|0777, st_size=42, ...}) = 0
read(3, "\xff\xfe\x20\x00\x20\x00\x4a\x00\x6f\x00\x73\x00\x68\x00\x20\x00\x0d\x00\x0a\x00\x20\x00\x20\x00\x41\x00\x6e\x00\x6f\x00\x6e\x00"..., 98304) = 42
read(3, "", 98304) = 0
close(3) = 0 = 0
...
$
Huh? No characters? What are these "
\xff\xfe\x20\x00...
"? How about
$ cat josh.txt
J o s h
A n o n y m o u s
$
At this moment, I realized that the character encoding is neither ASCII nor UTF-8, and it must be something else, and the leading bytes are the "Byte Order Marks (BOM)". Windows API documentation has a page that has the following,
Byte order mark | Description | |
---|---|---|
EF BB BF | UTF-8 | |
FF FE | UTF-16, little endian | |
FE FF | UTF-16, big endian | |
FF FE 00 00 | UTF-32, little endian | |
00 00 FE FF | UTF-32, big-endian |
Note
A byte order mark is not a control character that selects the byte order of the text.It turns out the text file is encoded in "UTF-16, little endian". Just for fun, I ran
file
,
$ file josh.txt
josh.txt: Little-endian UTF-16 Unicode text, with CRLF line terminators
$
That's it! I got this file from downloading it in Webex on the Windows host, and Webex must have encoded it in the Windows default encoding scheme, "UTF-16, little endian".
How do I
grep
this file? There might be many other methods. But I just use the iconv
command to convert the encoding from utf-16
to utf-8
, and then redirect the output to grep
, like,
$ iconv -f utf-16le -t utf-8 josh.txt | grep "Josh"
Josh
$ echo $?
0
$
Problem solved!
No comments:
Post a Comment