UTF-8 & BOM
11 Aug 2014What’s BOM(Byte-Order Mark)
The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. It is encoded at U+FEFF byte order mark (BOM). BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.
常用编码中,有 BOM 标识的如下表:
Encoding | Representation | Bytes as CP1252 characters | Notes |
---|---|---|---|
UTF-8 | EF BB BF |  | UTF-8 实际上不存在字节序问题,因此通常不使用 BOM |
UTF-16 (BE) | FE FF | þÿ | |
UTF-16 (LE) | FF FE | ÿþ | |
UTF-32 (BE) | 00 00 FE FF | <00><00>þÿ | <00> refers to the ASCII null character |
UTF-32 (LE) | FF FE 00 00 | ÿþ<00><00> | <00> refers to the ASCII null character |
GB18030 | 84 31 95 33 | „1•3 | GB18030 实际上不存在字节序问题,因此通常不使用 BOM |
其中:
- UTF-16 和 UTF-32 的BOM 标识暂时没有发现无法正常处理的应用程序
- GB18030 的 BOM 标识非常罕见,需要注意的是 iconv 和 libiconv 处理行为与 UTF-8 BOM 类似,请参考“ICONV PROBLEM”
Who Uses BOM
- UTF-16 和 UTF-32 的 BOM 标识是为了解决字节序问题而存在的。
- UTF-8 的 BOM 标识不会对编码产生任何实际影响。按照 unicode.org 的解释,是为了区分 UTF-8 编码的数据流和 ASCII 编码的数据流而存在的。部分应用程序确实遵循这个设计。
什么情况下会有BOM
这里只讨论 UTF-8 BOM
有部分应用程序会在保存 UTF-8 文件的时候在文件头添加 BOM 标识。已知的应用程序如下:
默认添加:
以下程序在将文件保存为 UTF-8 文本时会默认添加 BOM 标识。
- Microsoft VisualStudio
- Microsoft Notepad(俗称“记事本”)
- Eclipse
- Notepad++
- UltraEdit
特殊情况:
以下程序在以特殊方式保存文件时会为 UTF-8 文本添加 BOM 标识。
- LibreOffice(“Save As Encoded Text” 选择 UTF-8 默认添加 BOM)
- Sublime Text ( Save With Encoding 有一个 “UTF-8 with BOM” 选项)
有什么问题
-
当 BOM 出现在 UTF-8 文件头部的时候是没有问题的,几乎所有常用软件都能正确识别它们(iconv 是个例外)
-
gcc < 4.4.0 在尝试编译带 BOM 标识的源代码时会报错
-
iconv 以及使用 libiconv 的许多软件会死板的 对 BOM 也做转换,如果找不到对应的字符,则转换失败并会给出一个报错。详细说明参考“ICONV PROBLEM”
-
BOM 出现在文件内容中的时候
- Unicode 3.2 之前,被认为是“零宽字符(ZERO-WIDTH NON-BREAKING SPACE)”
- 自 Unicode 3.2 起这种用法已经被废弃掉,使用 U+2060 WORD JOINER 代替
- 常用软件基本上仍然将处于文件内容中的 BOM 标识作为零宽字符处理
In the absence of a protocol supporting its use as a BOM and when not at the beginning of a text stream, U+FEFF should normally not occur. For backwards compatibility it should be treated as ZERO WIDTH NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the file or string. The use of U+2060 WORD JOINER is strongly preferred over ZWNBSP for expressing word joining semantics since it cannot be confused with a BOM. When designing a markup language or data protocol, the use of U+FEFF can be restricted to that of Byte Order Mark. In that case, any U+FEFF occurring in the middle of a file can be treated as an unsupported character.
ICONV PROBLEM
尝试执行如下语句:
echo -e "\xef\xbb\xbf测试\n" | iconv -f UTF-8 -t GBK ## GBK 没有 BOM
将会出现:
iconv: (stdin):1:0: cannot convert
而尝试执行
echo -e "\xef\xbb\xbf测试\n" | iconv -f UTF-8 -t GB18030 ## GB18030 有 BOM
就会得到正确的结果:
0000000 3184 3395 d0d6 c4ce e2b2 d4ca 000a
0000015
用于清理 UTF-8 BOM 的 脚本
Python 版本1
逐字节读取传入的字符串,根据UTF-8编码规范进行比对,发现0xEFBBBF就清除掉
def utf_len(ch):
"""
Get the length of a UTF-8 character in bytes.
"""
# if ch is 0xxx xxxx, it has 7 bits ( or 1 byte )
if ((ch & 0x80) == 0):
return 1
# if ch is 110x xxxx, it has 11 bits ( or 2 bytes )
if ((ch & 0xE0) == 0xC0):
return 2
# if ch is 1110 xxxx, it has 16 bits ( or 3 bytes )
if ((ch & 0xF0) == 0xE0):
return 3
# if ch is 1111 0xxx, it has 21 bits ( or 4 bytes )
if ((ch & 0xF8) == 0xF0):
return 4
# if ch is 1111 10xx, it has 26 bits ( or 5 bytes )
if ((ch & 0xFC) == 0xF8):
return 5
# if ch is 1111 110x, it has 31 bits ( or 6 bytes )
if ((ch & 0xFE) == 0xFC):
return 6
# Invalid UTF-8 encoding.
return 1
def clear_bom(string):
"""
Function to delete UTF-8 BOM character in "string"
"""
from binascii import hexlify
ch = ''
result = ''
pointer = 0
while True:
# EOF reached, end loop
if pointer >= len(string):
break
# read one byte to determine the length of this character
ch = int(hexlify(string[pointer]), 16)
# offset equals (length + 1)
length = utf_len(ch)
offset = length + 1
if length == 3:
# if this character is '\xef\xbb\xbf', ignore it
if string[pointer:pointer+offset] == '\xef\xbb\xbf':
pass
else:
# read all the character, append it to the result string
result = result + string[pointer:pointer+offset]
# pointer points to the next
pointer = pointer + offset
return(result)
Python 版本2
除了代码容易看懂一点,跟上面的普通版本没什么区别,甚至效率可能更低。
def clear_bom_unicode(string):
"""
Function to delete UTF-8 BOM character in "string", by decoding it to unicode string.
"""
# decode the string to unicode
string = string.decode('utf8')
result = ''
# read one by one character
for ch in string:
# if it is u'\uefbbbf', ignore it, else append
if string <> '\uefbbbf':
result = result+ch
return(result.encode('utf8'))
Bash
这段代码来自The Grey Blog,国内用户想看原文请自备梯子。
set -o nounset
set -o errexit
DELETE_ORIG=true
DELETE_FLAG=""
RECURSIVE=false
PROCESSALLFILE=false
PROCESSING_FILES=false
PROCESSALLFILE_FLAG=""
SED_EXEC=sed
USE_EXT=false
FILE_EXT=""
TMP_CMD="mktemp"
TMP_OPTS="--tmpdir="
XDEV=""
ISDARWIN=false
if [ $(uname) == "SunOS" ] ; then
if [ -x /usr/gnu/bin/sed ] ; then
echo "Using GNU sed..."
SED_EXEC=/usr/gnu/bin/sed
fi
TMP_OPTS="-p "
fi
if [ $(uname) == "Darwin" ] ; then
TMP_OPTS="-t tmp"
SED_EXEC="perl -pe"
echo "Using perl..."
ISDARWIN=true
fi
function usage() {
echo "bom-remove [-adrx] [-s sed-name] [-e ext] files..."
echo ""
echo " -a Remove the BOM throughout the entire file."
echo " -e Look only for files with the chosen extensions."
echo " -d Do not overwrite original files and do not remove temp files."
echo " -r Scan subdirectories."
echo " -s Specify an alternate sed implementation."
echo " -x Don't descend directories in other filesystems."
}
function checkExecutable() {
if ( ! which "$1" > /dev/null 2>&1 ); then
echo "Cannot find executable:" $1
exit 4
fi
}
function parseArgs() {
while getopts "adfrs:e:x" flag
do
case $flag in
a) PROCESSALLFILE=true ; PROCESSALLFILE_FLAG="-a" ;;
r) RECURSIVE=true ;;
f) PROCESSING_FILES=true ;;
s) SED_EXEC=$OPTARG ;;
e) USE_EXT=true ; FILE_EXT=$OPTARG ;;
d) DELETE_ORIG=false ; DELETE_FLAG="-d" ;;
x) XDEV="-xdev" ;;
*) echo "Unknown parameter." ; usage ; exit 2 ;;
esac
done
shift $(($OPTIND - 1))
if [ $# == 0 ] ; then
usage;
exit 2;
fi
# fixing darwin
if [[ $ISDARWIN == true && $PROCESSALLFILE == false ]] ; then
PROCESSALLFILE=true
echo "Process all file is implicitly set on Darwin."
fi
FILES=("$@")
if [ ! -n "$FILES" ]; then
echo "No files specified. Exiting."
fi
if [ $RECURSIVE == true ] && [ $PROCESSING_FILES == true ] ; then
echo "Cannot use -r and -f at the same time."
usage
exit 1
fi
checkExecutable $SED_EXEC
checkExecutable $TMP_CMD
}
function processFile() {
if [ $(uname) == "Darwin" ] ; then
TEMPFILENAME=$($TMP_CMD $TMP_OPTS)
else
TEMPFILENAME=$($TMP_CMD $TMP_OPTS"$(dirname "$1")")
fi
echo "Processing $1 using temp file $TEMPFILENAME"
if [ $PROCESSALLFILE == false ] ; then
cat "$1" | $SED_EXEC '1 s/\xEF\xBB\xBF//' > "$TEMPFILENAME"
else
cat "$1" | $SED_EXEC 's/\xEF\xBB\xBF//g' > "$TEMPFILENAME"
fi
if [ $DELETE_ORIG == true ] ; then
if [ ! -w "$1" ] ; then
echo "$1 is not writable. Leaving tempfile."
else
echo "Removing temp file..."
mv "$TEMPFILENAME" "$1"
fi
fi
}
function doJob() {
# Check if the script has been called from the outside.
if [ $PROCESSING_FILES == true ] ; then
for i in $(seq 1 ${#FILES[@]})
do
echo ${FILES[$i-1]}
processFile "${FILES[$i-1]}"
done
else
# processing every file
for i in $(seq 1 ${#FILES[@]})
do
CURRFILE=${FILES[$i-1]}
# checking if file or directory exist
if [ ! -e "$CURRFILE" ] ; then echo "File not found: $CURRFILE. Skipping..." ; continue ; fi
# if a paremeter is a directory, process it recursively if RECURSIVE is set
if [ -d "$CURRFILE" ] ; then
if [ $RECURSIVE == true ] ; then
if [ $USE_EXT == true ] ; then
find "$CURRFILE" $XDEV -type f -name "*.$FILE_EXT" -exec "$0" $DELETE_FLAG $PROCESSALLFILE_FLAG -f "{}" \;
else
find "$CURRFILE" $XDEV -type f -exec "$0" $DELETE_FLAG $PROCESSALLFILE_FLAG -f "{}" \;
fi
else
echo "$CURRFILE is a directory. Skipping..."
fi
else
processFile "$CURRFILE"
fi
done
fi
}
parseArgs "$@"
doJob
Perl
代码来自rishida’s blog
if ($#ARGV > 0) {
print STDERR "Too many arguments!\n";
exit;
}
my @file; # file content
my $lineno = 0;
my $filename = @ARGV[0];
if ($filename) {
open( BOMFILE, $filename ) || die "Could not open source file for reading.";
while (<BOMFILE>) {
if ($lineno++ == 0) {
if ( index( $_, '' ) == 0 ) {
s/^\xEF\xBB\xBF//;
print "BOM found and removed.\n";
}
else { print "No BOM found.\n"; }
}
push @file, $_ ;
}
close (BOMFILE) || die "Can't close source file after reading.";
open (NOBOMFILE, ">$filename") || die "Could not open source file for writing.";
foreach $line (@file) {
print NOBOMFILE $line;
}
close (NOBOMFILE) || die "Can't close source file after writing.";
}
else { # STDIN -> STDOUT
while (<>) {
if (!$lineno++) {
s/^\xEF\xBB\xBF//;
}
push @file, $_ ;
}
foreach $line (@file) {
print $line;
}
}