Sunday, May 24, 2020

"grep" does not work on WSL?

Does "grep" not always work on the Windows Subsystem for Linux (WSL) on Windows 10? I has been investigating this problem that bothered me a great deal. Here is what I saw,

$ vi josh.txt

What I saw in vi is,

  Josh
  Anonymous
~       
~                                                                        ~  

OK, let's grep something ...

$ grep "Josh" josh.txt
$ echo $?
1

Should I have seen a match and exit-code 0 instead? I haven't gotten a clue until I ran strace,

$ strace grep "Josh" josh.txt
...
openat(AT_FDCWD, "josh.txt", O_RDONLY|O_NOCTTY) = 3
fstat(3, {st_mode=S_IFREG|0777, st_size=42, ...}) = 0
read(3, "\377\376 \0 \0J\0o\0s\0h\0 \0\r\0\n\0 \0 \0A\0n\0o\0n\0"..., 98304) = 42
read(3, "", 98304)                      = 0
close(3)                                = 0
...
$

Good, I saw 'J', 'o, ..., but what are these '\377', '\376', ... Instead of doing octal numbers to hexadecimal number conversion, I let strace do this for me, and

$ strace grep "Josh" josh.txt
...
openat(AT_FDCWD, "josh.txt", O_RDONLY|O_NOCTTY) = 3
fstat(3, {st_mode=S_IFREG|0777, st_size=42, ...}) = 0
read(3, "\xff\xfe\x20\x00\x20\x00\x4a\x00\x6f\x00\x73\x00\x68\x00\x20\x00\x0d\x00\x0a\x00\x20\x00\x20\x00\x41\x00\x6e\x00\x6f\x00\x6e\x00"..., 98304) = 42
read(3, "", 98304)                      = 0
close(3)                                = 0                            = 0
...
$

Huh? No characters? What are these "\xff\xfe\x20\x00..."? How about

$ cat josh.txt
    J o s h
     A n o n y m o u s
$

At this moment, I realized that the character encoding is neither ASCII nor UTF-8, and it must be something else, and the leading bytes are the "Byte Order Marks (BOM)". Windows API documentation has a page that has the following,

Byte order markDescription
EF BB BF UTF-8
FF FE UTF-16, little endian
FE FF UTF-16, big endian
FF FE 00 00 UTF-32, little endian
00 00 FE FF UTF-32, big-endian
Note
A byte order mark is not a control character that selects the byte order of the text.

It turns out the text file is encoded in "UTF-16, little endian". Just for fun, I ran file,

$ file josh.txt
josh.txt: Little-endian UTF-16 Unicode text, with CRLF line terminators
$

That's it! I got this file from downloading it in Webex on the Windows host, and Webex must have encoded it in the Windows default encoding scheme, "UTF-16, little endian".

How do I grep this file? There might be many other methods. But I just use the iconv command  to convert the encoding from utf-16 to utf-8, and then redirect the output to grep, like,

$ iconv -f utf-16le -t utf-8 josh.txt | grep "Josh"
  Josh
$ echo $?
0
$

Problem solved!

No comments:

Post a Comment