Does "grep" not always work on the Windows Subsystem for Linux (WSL) on Windows 10? I has been investigating this problem that bothered me a great deal. Here is what I saw,
$ vi josh.txt
What I saw in
vi
is,
Josh
Anonymous
~
~ ~
OK, let's
grep
something ...
$ grep "Josh" josh.txt
$ echo $?
1
Should I have seen a match and exit-code 0 instead?
I haven't gotten a clue until I ran
strace
,
$ strace grep "Josh" josh.txt
...
openat(AT_FDCWD, "josh.txt", O_RDONLY|O_NOCTTY) = 3
fstat(3, {st_mode=S_IFREG|0777, st_size=42, ...}) = 0
read(3, "\377\376 \0 \0J\0o\0s\0h\0 \0\r\0\n\0 \0 \0A\0n\0o\0n\0"..., 98304) = 42
read(3, "", 98304) = 0
close(3) = 0
...
$
Good, I saw '
J
', '
o
, ..., but what are these '
\377
', '
\376
', ... Instead of doing octal numbers to hexadecimal number conversion, I let
strace
do this for me, and
$ strace grep "Josh" josh.txt
...
openat(AT_FDCWD, "josh.txt", O_RDONLY|O_NOCTTY) = 3
fstat(3, {st_mode=S_IFREG|0777, st_size=42, ...}) = 0
read(3, "\xff\xfe\x20\x00\x20\x00\x4a\x00\x6f\x00\x73\x00\x68\x00\x20\x00\x0d\x00\x0a\x00\x20\x00\x20\x00\x41\x00\x6e\x00\x6f\x00\x6e\x00"..., 98304) = 42
read(3, "", 98304) = 0
close(3) = 0 = 0
...
$
Huh? No characters? What are these "
\xff\xfe\x20\x00...
"? How about
$ cat josh.txt
J o s h
A n o n y m o u s
$
At this moment, I realized that the character encoding is neither ASCII nor UTF-8, and it must be something else, and the leading bytes are the "Byte Order Marks (BOM)".
Windows API documentation has a page that has the following,
Byte order mark | Description |
EF BB BF |
UTF-8 |
FF FE |
UTF-16, little endian |
FE FF |
UTF-16, big endian |
FF FE 00 00 |
UTF-32, little endian |
00 00 FE FF |
UTF-32, big-endian | |
Note
A byte order mark is not a control character that selects the byte order of the text.
It turns out the text file is encoded in "UTF-16, little endian". Just for fun, I ran
file
,
$ file josh.txt
josh.txt: Little-endian UTF-16 Unicode text, with CRLF line terminators
$
That's it! I got this file from downloading it in Webex on the Windows host, and Webex must have encoded it in the Windows default encoding scheme, "UTF-16, little endian".
How do I
grep
this file? There might be many other methods. But I just use the
iconv
command to convert the encoding from
utf-16
to
utf-8
, and then redirect the output to
grep
, like,
$ iconv -f utf-16le -t utf-8 josh.txt | grep "Josh"
Josh
$ echo $?
0
$
Problem solved!