From: Greg Bailey [mailto:greg@minerva.com] Sent: Tuesday, June 01, 1999 7:41 PM To: 'ANSForth real mailgroup' Cc: 'Localisation and Internationalisation'; 'ark-gvb-i' Subject: Octet String Prospectus Problem Statement: ------------------ Most standards defining interoperable data structures, such as for example those used in networking and cryptography, do so in terms of sequences of octets. Even in embedded applications, these standards are increasingly relevant and are indeed supporting them is often a critical application requirement. The most commonly encountered computer architectures today address their memories in units of 8 bit bytes, and Standard Forth appli- cations have no difficulty in manipulating octet sequences directly when running on typical systems, with eight bit character sets, for such machines. However, such applications are environmentally dependent upon this common combination in which addresses are in units of bytes or octets, *and* in which characters are eight bits wide; or upon machines whose addresses are in units such as 4-bit nibbles which divide 8, and whose characters are also eight bits wide. On these families of architectures portable software may manipulate octet sequences by treating them as characters. If, however, either character size or address units are larger than eight bits, we do not document standard ways of allocating, manipulating, or performing I/O using sequences of octets. This proposal provides mechanism that may be used by standard programs to manipulate sequences of octets on any standard system which supports it. (Actual packaging TBD. Should probably be an extension, but if so it will depend upon presence of the DOUBLE extension; and it will include additions to the FILE extension if both are present.) Discussion of common practice and architectural tradeoffs: ---------------------------------------------------------- Many systems and applications have been written for "cell addressed" machines with 16 bit and larger address units. Many strategies have been used for addressing characters, which were generally equivalent to octets, on such machines. In general the hardware does not directly support linear addressing of bytes, characters, or octets, so this type of arithmetically usable address has generally been simulated in software. The most commonly used strategy has been to multiply the physical, cell address by the number of octets held within a cell, and add to this product the relative position of the octet within the cell, in order to form a linear octet address. Coding strategies for employing this additional, synthetic address data type depend on the nature of the underlying CPU. Since there is usually a substantial performance penalty for using these synthetic addresses, it has been common practice to use the octet address data type only in conjunction with octet operators, and to use native cell addresses for all other purposes. Since the dynamic range required of this synthetic data type is one or more bits larger than for native address units, it follows that if the machine supports full cell width cell addresses, then an address capable of identifying any stored character or octet within the memory must be greater than one cell in width. A number of practical systems have used cell width octet addresses with varying degrees of success. For example, a number of the 16- bit minicomputers have been restricted architecturally to 15 bit cell addressing; in fact, in some cases, the 16th bit has been used to mark indirect addresses. On such systems, it has been possible to address all of memory with a 16 bit octet address, with no negative side effects. Less successful have been efforts to use 16 bit synthetic octet addresses on machines that support full 16 bit cell addressing. One strategy is to limit octet addressing to the low half of memory. Another is to "float" octet addressing upon each task's private memory. Yet another subdivides octet addressable space into a static, common region and another which is "floated". Each of these strategies has inflicted pain upon programmers who have had to live with them. A slightly less obvious form of this pain has been experienced when maintaining a single source base that runs on both cell and octet addresed machines. In a typical synthetic addressing scheme for such 16 bit machines, it is possible to convert a cell address into the synthetic address of its first octet by simply doubling the cell address. The advantage of this transformation was that all the system had to do was specify which operators took octet addresses as opposed to cell addresses, and expect the programmer to use the conversion operator when needed. This avoided the need for special allocation and declaration functions for octet space. The disadvantage is that, when running on an octet addressed machine, the conversion operators were no-ops. The consequence of failing to use a conversion operator, or of using the wrong address type with a given function, were nil. As a result, a programmer could change such a program inattentively, test it on an octet addressed machine, and never discover the bugs thus introduced until the program was later run on a cell addressed machine. Practical experience has shown that this error is easy to make, hard to detect, and is a direct consequence of having an octet address that is of the same size and the same value as is the regular memory address on octet addressed machines. As a result, it appears that from the perspective of human factors this is an architecture to be avoided. Based on this experience, it is proposed that explicit octet add- ressing be done using an ordered pair. This practice has actually been used in a number of systems, and is also the method often used in hardware and software support for octet sequences on large cell addressed mainframes. Synopsis of proposed architecture: ---------------------------------- The ordered pair of an Octet Address consists of a Base Address and an Octet Index. The base Address is the standard Address of the beginning of a memory allocation declared for an Octet Sequence. All Octet Addresses within that allocation share the same Base Address, and there is no portable method for transforming an Octet Address with a given Base Address to use a different Base Address. The Octet Index is a zero relative positive integer denoting the position of an octet within the sequence which starts at the Base Address. On the stack, the Base Address is on top. Arithmetic on Octet Addresses is meaningful only when subtracting the address of one octet from that of another within the same sequence, or when adding or subtracting a scalar to or from the address of an octet. This structure and these rules allow the application to use double operators such as M+ and D- for the valid arithmetic if those operators are assumed present; otherwise, since such valid arithmetic never involves carries or borrows between the Index and Base parts of the Octet Address, they are amenable to simple arithmetic operations using standard CORE operators and similarly for machine code. For example, the difference between two Octet Addresses that may be validly compared may be computed ROT 2DROP - ( in lieu of D- ) and an Octet Address may be decremented using SWAP 1- SWAP ( in lieu of -1 M+ ) Incrementation is of course done by the dedicated operator below. Finally, this arrangement leads to syntax which is analogous to that which is commonly used with arrays in Forth. If PACKET has been declared as an octet sequence, the phrase: 5 PACKET places on the stack the formal Octet Address of the sixth octet in that sequence since PACKET simply provides the Base Address for that sequence. In a loop, I PACKET or 4 + DUP PACKET occurs naturally as it does with arrays, helping out with stack bloat that would occur if "indexing" were not available and arithmetic on the double form was the only way to navigate. I believe, based on considerable experience, that this is the cleanest way to deal with this issue. In fact, it is precisely the solution that ATHENA uses for data structures defined as sequences of *bits*, where it has served well, led to readable code, and produced no glaring inconsistencies. Based on this, the minimum set of things we might need is: OCTETS ( n1 - n2) Clone defn from CHARS OCTET+ ( 8-addr1 - 8-addr2) Clone defn from CHAR+ 8@ ( 8-addr - u) Clone defn from C@ 8! ( u 8-addr) Clone defn from C! 8MOVE ( 8-addr1 8-addr2 u) Clone defn from CMOVE It is strictly coincidental that "8" looks very much like "B" at first glance ;-) Storage for octet sequences is allocated using the present conventions for allocating and identifying *aligned* addresses. For example, CREATE PACKET 536 OCTETS ALLOT ... , ... ALIGN HERE 64 OCTETS ALLOT ... For the purpose of complying with standards, the first form is more likely to be used. The requirement for ALIGNing Base Addresses facilitates efficient implementations on the universe of equipment. Addition of octet sequence support to the FILE extension must be done in such a way that it is independent of character size, which may be larger than an octet. However, as written all FILE operators function in terms of lengths and positions whose units are charcters. Because more than one octet position may map onto the same character position, dealing with the same file ID in terms of both octets and characters would be problematic. Instead, the following is proposed: OCT ( fam1 - fam2) Modify the implementation-defined file access method fam1 to additionally select an octet oriented, as opposed to character or file oriented, access method. When a file ID has been opened with the OCT access method, all file positions and sizes used in association with that file are in units of octets instead of characters. In addition, it is an amgiguous condition to use READ-FILE, READ-LINE, WRITE-FILE, WRITE-LINE, or INCLUDE-FILE with such a file ID. INCLUDED is not mentioned in this list because it does not consume a file ID. READ-OCTET ( 8-addr u1 fileid - u2 ior) Clone from READ-FILE Note ambiguous condition if used with a fileid not opened as OCT WRITE-OCTET ( 8-addr u fileid - ior) Clone from WRITE-FILE Note ambiguous condition if used with a fileid not opened as OCT This appears to be the minimum necessary change. READ-FILE and WRITE-FILE are not overloaded because experience indicates that having different arguments for the same function depending on a flag leads to maintenance problems. If written, this proposal will of course have to include a number of details in sections 2, 3, and 4 as well as 11 and whatever is assigned for this extension. ALTERNATIVE STRUCTURE 1: ------------------------ If the TC strongly feels that this is too much solution for the problem, there is a simpler alternative that is logically self consistent: 1. An octet is guaranteed to fit inside the storage allocation for a character. 2. Therefore, omit all of this except the FILE wordset part. 3. In the FILE wordset, include OCT but simply note that in this access method octets are read from and written to the device, sizes and positions are in octets, and the data are read into and written from character storage such that octets are right justified and zero filled into characters on READ-FILE, only the low order eight bits of each character are written by WRITE-FILE, and that READ-LINE, WRITE-LINE, and INCLUDE-FILE are ambiguous with an OCT file handle. The disadvantage of this is that while it would allow everyone with the AU=byte=char=octet dependency to congratulate themselves as having complied without doing any work, it would not address the physical storage structures commonly used by hardware and operating systems for cell addressed equipment, and would be inefficient on byte addressed machines with large characters. ALTERNATIVE STRUCTURE 2: ------------------------ It might be more useful to use the initial structure above but to de-ambiguify READ-FILE and WRITE-FILE by incorporating the con- ventions in item 3. of alternative 1 above. What this would buy is that an existing AU=byte=octet=char application that had to be converted in a hurry to use say 16 bit characters could adapt to such a system by using OCT as file access method with no other changes (assuming it was coded with CHARS and CHAR+ as needed) and still operate upon its octet sequence structures with reduced efficiency. For that matter, it could run on cell addressed hard- ware with similarly reduced efficiency. In either case, at leisure and if necessary the application could be upgraded to actually use the Octet Addressing functions, but in the meanwhile there would be a fast and dirty way to solve the problem with minimal effort. At present I think that Alternative 2 would be the wisest of these three. Perhaps the part of Alternative 2 taken from Alternative 1 could be the OCTET extension, and the rest of it could be called OCTET EXT. Or, if one felt more strongly about it, OCT could be added to the base FILE wordset along with the change in behavior of that wordset per Alternative 1, and the rest of 2 implemented as simply the OCTET wordset with no OCTET EXT (as yet). For those maintaining typical systems, that could require as little as adding OCT as a no-op. Obviously it would be nice to have a first draft that might pass, so these packaging issues should be more or less resolved first. In that regard the central question is, to me, how essential and therefore how non-optional each of these layers should be. ----------------------------------------------------------------- Greg Bailey | ATHENA Programming, Inc | 503-295-7703 | ---------------- | 310 SW 4th Ave Ste 530 | fax 295-6935 | greg@minerva.com | Portland, OR 97204 US |