Does "grep" not always work on the Windows Subsystem for Linux (WSL) on Windows 10? I has been investigating this problem that bothered me a great deal. Here is what I saw,
$ vi josh.txt
What I saw in
vi is,
Josh
Anonymous
~
~ ~
OK, let's
grep something ...
$ grep "Josh" josh.txt
$ echo $?
1
Should I have seen a match and exit-code 0 instead?
I haven't gotten a clue until I ran
strace,
$ strace grep "Josh" josh.txt
...
openat(AT_FDCWD, "josh.txt", O_RDONLY|O_NOCTTY) = 3
fstat(3, {st_mode=S_IFREG|0777, st_size=42, ...}) = 0
read(3, "\377\376 \0 \0J\0o\0s\0h\0 \0\r\0\n\0 \0 \0A\0n\0o\0n\0"..., 98304) = 42
read(3, "", 98304) = 0
close(3) = 0
...
$
Good, I saw '
J', '
o, ..., but what are these '
\377', '
\376', ... Instead of doing octal numbers to hexadecimal number conversion, I let
strace do this for me, and
$ strace grep "Josh" josh.txt
...
openat(AT_FDCWD, "josh.txt", O_RDONLY|O_NOCTTY) = 3
fstat(3, {st_mode=S_IFREG|0777, st_size=42, ...}) = 0
read(3, "\xff\xfe\x20\x00\x20\x00\x4a\x00\x6f\x00\x73\x00\x68\x00\x20\x00\x0d\x00\x0a\x00\x20\x00\x20\x00\x41\x00\x6e\x00\x6f\x00\x6e\x00"..., 98304) = 42
read(3, "", 98304) = 0
close(3) = 0 = 0
...
$
Huh? No characters? What are these "
\xff\xfe\x20\x00..."? How about
$ cat josh.txt
J o s h
A n o n y m o u s
$
At this moment, I realized that the character encoding is neither ASCII nor UTF-8, and it must be something else, and the leading bytes are the "Byte Order Marks (BOM)".
Windows API documentation has a page that has the following,
| Byte order mark | Description |
| EF BB BF |
UTF-8 |
| FF FE |
UTF-16, little endian |
| FE FF |
UTF-16, big endian |
| FF FE 00 00 |
UTF-32, little endian |
| 00 00 FE FF |
UTF-32, big-endian | |
Note
A byte order mark is not a control character that selects the byte order of the text.
It turns out the text file is encoded in "UTF-16, little endian". Just for fun, I ran
file,
$ file josh.txt
josh.txt: Little-endian UTF-16 Unicode text, with CRLF line terminators
$
That's it! I got this file from downloading it in Webex on the Windows host, and Webex must have encoded it in the Windows default encoding scheme, "UTF-16, little endian".
How do I
grep this file? There might be many other methods. But I just use the
iconv command to convert the encoding from
utf-16 to
utf-8, and then redirect the output to
grep, like,
$ iconv -f utf-16le -t utf-8 josh.txt | grep "Josh"
Josh
$ echo $?
0
$
Problem solved!