What's different between UTF-8 and UTF-8 without BOM?

admin · 發表於 2017-11-13 08:33:50

The UTF-8 BOM is a sequence of bytes (EF BB BF) that allows the reader to identify a file as being encoded in UTF-8.

Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.

According to the Unicode standard, the BOM for UTF-8 files is not recommended

位元組順序記號（英語：byte-order mark，BOM）是位於碼點U+FEFF的統一碼字元的名稱。當以UTF-16或UTF-32來將UCS/統一碼字元所組成的字串編碼時，這個字元被用來標示其位元組序。它常被用來當做標示檔案是以UTF-8、UTF-16或UTF-32編碼的記號。

Notepad isn't able to work with UTF-8 without BOM. If you want to use Notepad, keep BOM. If you want to continue without BOM, use a more capable text editor (my choice is Notepad++)

編碼	表示（十六進位）	表示（十進位）
UTF-8	EF BB BF	239 187 191
UTF-16（大端序）	FE FF	254 255
UTF-16（小端序）	FF FE	255 254
UTF-32（大端序）	00 00 FE FF	0 0 254 255
UTF-32（小端序）	FF FE 00 00	255 254 0 0
UTF-7	2B 2F 76和以下的一個位元組：[ 38 \| 39 \| 2B \| 2F ]	43 47 118和以下的一個位元組：[ 56 \| 57 \| 43 \| 47 ]
UTF-1	F7 64 4C	247 100 76
UTF-EBCDIC	DD 73 66 73	221 115 102 115
Unicode標準壓縮方案	0E FE FF	14 254 255
BOCU-1	FB EE 28 及可能跟隨著FF	251 238 40 及可能跟隨著255
GB-18030	84 31 95 33	132 49 149 51

If you use .NET, you can exclude BOM by using properly configured UTF8Encoding.
It's done by the parameter of UTF8Encoding's constructor in the following example (written in X++ using .NET Interop):

<div>System.Text.Encoding encoding = new System.Text.UTF8Encoding(false);</div><div>;</div><div>System.IO.File::WriteAllText(@'C:\test.txt', "Data 123", encoding);</div>

複製代碼

linux使用不带bom的utf8，一方面当然是因为utf8不需要bom，更重要的，是unix的设计逻辑没法兼容带bom的文本：一切皆文件，一切文件皆是流。一个流可以被任意的切断，独立解析，而不会改变含义。所以它不能有头，也不能有结尾。由于头根本不存在，所以bom不允许存在，否则你把一个流切成一万份，就必须在一万个片段的前面加bom，这种对流内容的修改违背了设计。osx派生自bsd，而bsd也同样遵循unix设计思想，所以无论对于linux还是osx，bom必然不能存在，无论有没有微软，都只能用无bom的utf8作为标准。这是unix设计理念所决定的，而不是所谓刻意制造不兼容。

而微软呢？情形也差不多，无论unix如何指定，它都只能设计成有bom的utf8，原因是微软的系统缺省都是用户当前代码页，当前代码页不是utf8，这样，utf8作为非当前代码页格式就无法识别。utf8的bom是微软自己的一个创新，微软增加的一个识别码。虽然用字节序标记当识别码是不优雅的，但对windows来说这么做不会有任何副作用。依赖一个明确定义的文件头标记对Windows来说完全可接受。这是为了兼容微软的历史版本系统，而并不是刻意制造与unix血统系统的不兼容。
powershell.exe -Command "Get-Content A0401.txt | Set-Content -Encoding utf8 A0401-utf8.txt"
複製代碼
有bom的utf8

		自動登錄	找回密碼
密碼			立即註冊