yohhoyの日記

技術的メモをしていきたい日記

改行コード(CR/LF)と改行文字と標準C

プログラミング言語C標準規格における改行文字(new-line character)と改行コードCR, LFとの関係性について。

まとめ:

  • C標準規格ではプログラム内部で扱う「改行文字」と、外部ファイルにおける具体的なCR, LF等の「文字コード」を区別する。*1 *2
  • 改行文字をファイル上でどう表現するかについて何ら規定しない。CR/LFを使わない方式も想定されている。
    • UNIX互換システムの場合、改行文字==改行コードLF(0x0A)となる。
    • Windows OSの場合、改行文字は2個の改行コードCRLF(0x0D 0x0A)で表現される。
    • 上記のような改行コードによる行区切り表現だけでなく、メタ情報を利用した行区切り位置表現、長さプレフィックスと文字列データ表現、固定長レコードと特殊パディング文字表現(!)*3など、多種多様なテキストデータの表現方式を許容する。
  • 仮想ターミナルなどの文字表示デバイスへ出力する場合、下記エスケープシーケンスの動作を規定する。
    • \n(new-line):次行の先頭位置にカーソル位置を移動する。
    • \r(carriage return):現在行の先頭位置にカーソル位置を移動する。

文字集合(Character sets)

C99 5.2.1/p1, p3より一部引用(下線部は強調)。

1 Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). (snip)

3 Both the basic source and basic execution character sets shall have the following members:
 (snip)
In source files, there shall be some way of indicating the end of each line of text; this International Standard treats such an end-of-line indicator as if it were a single new-line character. In the basic execution character set, there shall be control characters representing alert, backspace, carriage return, and new line. (snip)

C99 Rationale, 5.2.1より一部引用。

The C89 Committee ultimately came to remarkable unanimity on the subject of character set requirements. There was strong sentiment that C should not be tied to ASCII, despite its heritage and despite the precedent of Ada being defined in terms of ASCII. Rather, an implementation is required to provide a unique character code for each of the printable graphics used by C, and for each of the control codes representable by an escape sequence. (No particular graphic representation for any character is prescribed; thus the common Japanese practice of using the glyph "¥" for the C character "\" is perfectly legitimate.) Translation and execution environments may have different character sets, but each must meet this requirement in its own way. The goal is to ensure that a conforming implementation can translate a C translator written in C.

For this reason, and for economy of description, source code is described as if it undergoes the same translation as text that is input by the standard library I/O routines: each line is terminated by some newline character regardless of its external representation.

文字表示セマンティクス(Character display semantics)

C99 5.2.2/p2-3より一部引用(下線部は強調)。

2 Alphabetic escape sequences representing nongraphic characters in the execution character set are intended to produce actions on display devices as follows:

  • (snip)
  • \n (new line) Moves the active position to the initial position of the next line.
  • \r (carriage return) Moves the active position to the initial position of the current line.
  • (snip)

3 Each of these escape sequences shall produce a unique implementation-defined value which can be stored in a single char object. The external representations in a text file need not be identical to the internal representations, and are outside the scope of this International Standard.

C99 Rationale, 5.2.2より一部引用(下線部は強調)。

The Standard defines a number of internal character codes for specifying "format-effecting actions on display devices," and provides printable escape sequences for each of them. These character codes are clearly modeled after ASCII control codes, and the mnemonic letters used to specify their escape sequences reflect this heritage. Nevertheless, they are internal codes for specifying the format of a display in an environment-independent manner; they must be written to a text file to effect formatting on a display device. The Standard states quite clearly that the external representation of a text file (or data stream) may well differ from the internal form, both in character codes and number of characters needed to represent a single internal code.

The distinction between internal and external codes most needs emphasis with respect to new-line. INCITS L2, Codes and Character Sets (and now also ISO/IEC JTC 1/SC2/WG1, 8 Bit Character Sets), uses the term to refer to an external code used for information interchange whose display semantics specify a move to the next line. Although ISO/IEC 646 deprecates the combination of the motion to the next line with a motion to the initial position on the line, the C Standard uses new-line to designate the end-of-line internal code represented by the escape sequence '\n'. While this ambiguity is perhaps unfortunate, use of the term in the latter sense is nearly universal within the C community. But the knowledge that this internal code has numerous external representations depending upon operating system and medium is equally widespread.

ストリーム(Streams)

C99 7.19.2/p1-3より一部引用(下線部は強調)。

1 Input and output, whether to or from physical devices such as terminals and tape drives, or whether to or from files supported on structured storage devices, are mapped into logical data streams, whose properties are more uniform than their various inputs and outputs. Two forms of mapping are supported, for text streams and for binary streams.232)
脚注232) An implementation need not distinguish between text streams and binary streams. In such an implementation, there need be no new-line characters in a text stream nor any limit to the length of a line.

2 A text stream is an ordered sequence of characters composed into lines, each line consisting of zero or more characters plus a terminating new-line character. Whether the last line requires a terminating new-line character is implementation-defined. Characters may have to be added, altered, or deleted on input and output to conform to differing conventions for representing text in the host environment. Thus, there need not be a one-to-one correspondence between the characters in a stream and those in the external representation. Data read in from a text stream will necessarily compare equal to the data that were earlier written out to that stream only if: the data consist only of printing characters and the control characters horizontal tab and new-line; no new-line character is immediately preceded by space characters; and the last character is a new-line character. Whether space characters that are written out immediately before a new-line character appear when read in is implementation-defined.

C99 Rationale, 5.2.1, 7.19.2より一部引用。

C inherited its notion of text streams from the UNIX environment in which it was born. Having each line delimited by a single newline character, regardless of the characteristics of the actual terminal, supported a simple model of text as a sort of arbitrary length scroll or "galley." Having a channel that is "transparent" (no file structure or reserved data encodings) eliminated the need for a distinction between text and binary streams.

(snip)

Troublesome aspects of the stream concept include:

The definition of lines. In the UNIX model, division of a file into lines is effected by newline characters. Different techniques are used by other systems: lines may be separated by CR-LF (carriage return, line feed) or by unrecorded areas on the recording medium; or each line may be prefixed by its length. The Standard addresses this diversity by specifying that newline be used as a line separator at the program level, but then permitting an implementation to transform the data read or written to conform to the conventions of the environment.
Some environments represent text lines as blank-filled fixed-length records. Thus the Standard specifies that it is implementation-defined whether trailing blanks are removed from a line on input. (This specification also addresses the problems of environments which represent text as variable-length records, but do not allow a record length of 0: an empty line may be written as a one-character record containing a blank, and the blank is stripped on input.)

Transparency. Some programs require access to external data without modification. For instance, transformation of CR-LF to a newline character is usually not desirable when object code is processed. The Standard defines two stream types, text and binary, to allow a program to define, when a file is opened, whether the preservation of its exact contents or of its line structure is more important in an environment which cannot accurately reflect both.

関連URL

*1:プログラムへのデータ入出力をテキストストリーム(text stream)で扱う場合のお話。バイナリストリーム(binary stream)で扱う場合は外部ファイル上の文字コード値に直接アクセスする。

*2:標準入出力 stdin, strerr や標準エラー出力 stderr はいずれもテキストストリームとして初期化される。(C99 7.19.3/p7)

*3:テキストストリームをファイルに書き出す場合、処理系によっては一定長に切詰め(truncated)られる可能性が明記されている。(C99 7.19.3/p2)