Dark Side cat /dev/brain

UTF-8 & BOM

What’s BOM(Byte-Order Mark)

引自Wikipedia中文版):

The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. It is encoded at U+FEFF byte order mark (BOM). BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.

[Wikipedia](http://en.wikipedia.org/wiki/Byte_order_mark)

常用编码中,有 BOM 标识的如下表:

Encoding Representation Bytes as CP1252 characters Notes
UTF-8 EF BB BF  UTF-8 实际上不存在字节序问题,因此通常不使用 BOM
UTF-16 (BE) FE FF þÿ  
UTF-16 (LE) FF FE ÿþ  
UTF-32 (BE) 00 00 FE FF <00><00>þÿ <00> refers to the ASCII null character
UTF-32 (LE) FF FE 00 00 ÿþ<00><00> <00> refers to the ASCII null character
GB18030 84 31 95 33 „1•3 GB18030 实际上不存在字节序问题,因此通常不使用 BOM

其中:

  • UTF-16 和 UTF-32 的BOM 标识暂时没有发现无法正常处理的应用程序
  • GB18030 的 BOM 标识非常罕见,需要注意的是 iconv 和 libiconv 处理行为与 UTF-8 BOM 类似,请参考“ICONV PROBLEM

Who Uses BOM

  • UTF-16 和 UTF-32 的 BOM 标识是为了解决字节序问题而存在的。
  • UTF-8 的 BOM 标识不会对编码产生任何实际影响。按照 unicode.org 的解释,是为了区分 UTF-8 编码的数据流和 ASCII 编码的数据流而存在的。部分应用程序确实遵循这个设计。

什么情况下会有BOM

这里只讨论 UTF-8 BOM

有部分应用程序会在保存 UTF-8 文件的时候在文件头添加 BOM 标识。已知的应用程序如下:

默认添加:

以下程序在将文件保存为 UTF-8 文本时会默认添加 BOM 标识。

  • Microsoft VisualStudio
  • Microsoft Notepad(俗称“记事本”)
  • Eclipse
  • Notepad++
  • UltraEdit

特殊情况:

以下程序在以特殊方式保存文件时会为 UTF-8 文本添加 BOM 标识。

  • LibreOffice(“Save As Encoded Text” 选择 UTF-8 默认添加 BOM)
  • Sublime Text ( Save With Encoding 有一个 “UTF-8 with BOM” 选项)

有什么问题

  • 当 BOM 出现在 UTF-8 文件头部的时候是没有问题的,几乎所有常用软件都能正确识别它们(iconv 是个例外)

  • gcc < 4.4.0 在尝试编译带 BOM 标识的源代码时会报错

  • iconv 以及使用 libiconv 的许多软件会死板的 对 BOM 也做转换,如果找不到对应的字符,则转换失败并会给出一个报错。详细说明参考“ICONV PROBLEM

  • BOM 出现在文件内容中的时候

    • Unicode 3.2 之前,被认为是“零宽字符(ZERO-WIDTH NON-BREAKING SPACE)”
    • 自 Unicode 3.2 起这种用法已经被废弃掉,使用 U+2060 WORD JOINER 代替
    • 常用软件基本上仍然将处于文件内容中的 BOM 标识作为零宽字符处理

In the absence of a protocol supporting its use as a BOM and when not at the beginning of a text stream, U+FEFF should normally not occur. For backwards compatibility it should be treated as ZERO WIDTH NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the file or string. The use of U+2060 WORD JOINER is strongly preferred over ZWNBSP for expressing word joining semantics since it cannot be confused with a BOM. When designing a markup language or data protocol, the use of U+FEFF can be restricted to that of Byte Order Mark. In that case, any U+FEFF occurring in the middle of a file can be treated as an unsupported character.

[Unicode FAQ: UTF-8, UTF-16, UTF-32 & BOM](http://www.unicode.org/faq/utf_bom.html#BOM)

ICONV PROBLEM

尝试执行如下语句:

echo -e "\xef\xbb\xbf测试\n" | iconv -f UTF-8 -t GBK  ## GBK 没有 BOM

将会出现:

iconv: (stdin):1:0: cannot convert

而尝试执行

echo -e "\xef\xbb\xbf测试\n" | iconv -f UTF-8 -t GB18030 ## GB18030 有 BOM

就会得到正确的结果:

0000000      3184    3395    d0d6    c4ce    e2b2    d4ca    000a
0000015

用于清理 UTF-8 BOM 的 脚本

Python 版本1

逐字节读取传入的字符串,根据UTF-8编码规范进行比对,发现0xEFBBBF就清除掉


def utf_len(ch):
  """
  Get the length of a UTF-8 character in bytes.
  """
  # if ch is 0xxx xxxx, it has 7 bits ( or 1 byte )
  if ((ch & 0x80) == 0):
    return 1
  # if ch is 110x xxxx, it has 11 bits ( or 2 bytes )
  if ((ch & 0xE0) == 0xC0):
    return 2
  # if ch is 1110 xxxx, it has 16 bits ( or 3 bytes )
  if ((ch & 0xF0) == 0xE0):
    return 3
  # if ch is 1111 0xxx, it has 21 bits ( or 4 bytes )
  if ((ch & 0xF8) == 0xF0):
    return 4
  # if ch is 1111 10xx, it has 26 bits ( or 5 bytes )
  if ((ch & 0xFC) == 0xF8):
    return 5
  # if ch is 1111 110x, it has 31 bits ( or 6 bytes )
  if ((ch & 0xFE) == 0xFC):
    return 6
  # Invalid UTF-8 encoding.
  return 1

def clear_bom(string):
  """
  Function to delete UTF-8 BOM character in "string"
  """
  from binascii import hexlify
  ch = ''
  result = ''
  pointer = 0
  while True:
    # EOF reached, end loop
    if pointer >= len(string):
      break

    # read one byte to determine the length of this character
    ch = int(hexlify(string[pointer]), 16)
    # offset equals (length + 1)
    length = utf_len(ch)
    offset = length + 1
    if length == 3:
      # if this character is '\xef\xbb\xbf', ignore it
      if string[pointer:pointer+offset] == '\xef\xbb\xbf':
        pass
    else:
      # read all the character, append it to the result string
      result = result + string[pointer:pointer+offset]
    # pointer points to the next
    pointer = pointer + offset

  return(result)

Python 版本2

除了代码容易看懂一点,跟上面的普通版本没什么区别,甚至效率可能更低。

def clear_bom_unicode(string):
  """
  Function to delete UTF-8 BOM character in "string", by decoding it to unicode string.
  """

  # decode the string to unicode
  string = string.decode('utf8')
  result = ''
  # read one by one character
  for ch in string:
    # if it is u'\uefbbbf', ignore it, else append
    if string <> '\uefbbbf':
      result = result+ch
  return(result.encode('utf8'))

Bash

这段代码来自The Grey Blog,国内用户想看原文请自备梯子。

set -o nounset
set -o errexit


DELETE_ORIG=true
DELETE_FLAG=""
RECURSIVE=false
PROCESSALLFILE=false
PROCESSING_FILES=false
PROCESSALLFILE_FLAG=""
SED_EXEC=sed
USE_EXT=false
FILE_EXT=""
TMP_CMD="mktemp"
TMP_OPTS="--tmpdir="
XDEV=""
ISDARWIN=false


if [ $(uname) == "SunOS" ] ; then
  if [ -x /usr/gnu/bin/sed ] ; then
    echo "Using GNU sed..."
    SED_EXEC=/usr/gnu/bin/sed
  fi
  TMP_OPTS="-p "
fi


if [ $(uname) == "Darwin" ] ; then
  TMP_OPTS="-t tmp"

  SED_EXEC="perl -pe"
  echo "Using perl..."
  ISDARWIN=true

fi


function usage() {
  echo "bom-remove [-adrx] [-s sed-name] [-e ext] files..."
  echo ""
  echo "  -a    Remove the BOM throughout the entire file."
  echo "  -e    Look only for files with the chosen extensions."
  echo "  -d    Do not overwrite original files and do not remove temp files."
  echo "  -r    Scan subdirectories."
  echo "  -s    Specify an alternate sed implementation."
  echo "  -x    Don't descend directories in other filesystems."
}


function checkExecutable() {
  if ( ! which "$1" > /dev/null 2>&1 ); then
    echo "Cannot find executable:" $1
    exit 4
  fi
}


function parseArgs() {
  while getopts "adfrs:e:x" flag
  do
    case $flag in
      a) PROCESSALLFILE=true ; PROCESSALLFILE_FLAG="-a" ;;
      r) RECURSIVE=true ;;
      f) PROCESSING_FILES=true ;;
      s) SED_EXEC=$OPTARG ;;
      e) USE_EXT=true ; FILE_EXT=$OPTARG ;;
      d) DELETE_ORIG=false ; DELETE_FLAG="-d" ;;
      x) XDEV="-xdev" ;;
      *) echo "Unknown parameter." ; usage ; exit 2 ;;
    esac
  done


  shift $(($OPTIND - 1))



  if [ $# == 0 ] ; then
    usage;
    exit 2;
  fi



  # fixing darwin
  if [[ $ISDARWIN == true && $PROCESSALLFILE == false ]] ; then
    PROCESSALLFILE=true
    echo "Process all file is implicitly set on Darwin."
  fi

  FILES=("$@")


  if [ ! -n "$FILES" ]; then
    echo "No files specified. Exiting."
  fi


  if [ $RECURSIVE == true ]  && [ $PROCESSING_FILES == true ] ; then
    echo "Cannot use -r and -f at the same time."
    usage
    exit 1
  fi


  checkExecutable $SED_EXEC
  checkExecutable $TMP_CMD
}


function processFile() {
  if [ $(uname) == "Darwin" ] ; then
    TEMPFILENAME=$($TMP_CMD $TMP_OPTS)
  else
    TEMPFILENAME=$($TMP_CMD $TMP_OPTS"$(dirname "$1")")
  fi
  echo "Processing $1 using temp file $TEMPFILENAME"


  if [ $PROCESSALLFILE == false ] ; then
    cat "$1" | $SED_EXEC '1 s/\xEF\xBB\xBF//' > "$TEMPFILENAME"
  else
    cat "$1" | $SED_EXEC 's/\xEF\xBB\xBF//g' > "$TEMPFILENAME"
  fi


  if [ $DELETE_ORIG == true ] ; then
    if [ ! -w "$1" ] ; then
      echo "$1 is not writable. Leaving tempfile."
    else
      echo "Removing temp file..."
      mv "$TEMPFILENAME" "$1"
    fi
  fi
}


function doJob() {
  # Check if the script has been called from the outside.
  if [ $PROCESSING_FILES == true ] ; then
    for i in $(seq 1 ${#FILES[@]})
    do
      echo ${FILES[$i-1]}
      processFile "${FILES[$i-1]}"
    done


  else
    # processing every file
for i in $(seq 1 ${#FILES[@]})
do
CURRFILE=${FILES[$i-1]}
      # checking if file or directory exist
      if [ ! -e "$CURRFILE" ] ; then echo "File not found: $CURRFILE. Skipping..." ; continue ; fi

      # if a paremeter is a directory, process it recursively if RECURSIVE is set
      if [ -d "$CURRFILE" ] ; then
        if [ $RECURSIVE == true ] ; then
          if [ $USE_EXT == true ] ; then
            find "$CURRFILE" $XDEV -type f -name "*.$FILE_EXT" -exec "$0" $DELETE_FLAG $PROCESSALLFILE_FLAG -f "{}" \;
          else
            find "$CURRFILE" $XDEV -type f -exec "$0" $DELETE_FLAG $PROCESSALLFILE_FLAG -f "{}" \;
          fi
        else
          echo "$CURRFILE is a directory. Skipping..."
        fi
      else
        processFile "$CURRFILE"
      fi
    done
  fi
}


parseArgs "$@"
doJob

Perl

代码来自rishida’s blog

if ($#ARGV > 0) {
    print STDERR "Too many arguments!\n";
    exit;
    }

my @file;   # file content
my $lineno = 0;

my $filename = @ARGV[0];
if ($filename) {
    open( BOMFILE, $filename ) || die "Could not open source file for reading.";
    while (<BOMFILE>) {
        if ($lineno++ == 0) {
            if ( index( $_, '' ) == 0 ) {
                s/^\xEF\xBB\xBF//;
                print "BOM found and removed.\n";
                }
            else { print "No BOM found.\n"; }
            }
        push @file, $_ ;
        }
    close (BOMFILE)  || die "Can't close source file after reading.";

    open (NOBOMFILE, ">$filename") || die "Could not open source file for writing.";
    foreach $line (@file) {
        print NOBOMFILE $line;
        }
    close (NOBOMFILE)  || die "Can't close source file after writing.";
    }
else {  # STDIN -> STDOUT
    while (<>) {
    if (!$lineno++) {
        s/^\xEF\xBB\xBF//;
        }
    push @file, $_ ;
    }

    foreach $line (@file) {
        print $line;
        }
    }

Reference

  1. Unicode FAQ: UTF-8, UTF-16, UTF-32 & BOM
  2. iconvの「UTF-8」はBOMが無いものとみなす
  3. Forms of Unicode
  4. UTF-8 and Unicode
  5. RFC3629: UTF-8, a transformation format of ISO 10646
  6. Wikipedia: Byte order mark
  7. 文字コードに関する覚え書きと実験
  8. JavaでUTF-8のBOMに対処する
  9. The byte-order mark (BOM) in HTML
comments powered by Disqus