Main Page | Namespace List | Alphabetical List | Data Structures | Directories | File List | Data Fields | Globals | Related Pages

utf.c File Reference


Detailed Description

Manipulate UTF-8 CONSTANT_Utf8_info character strings.

There are three character string types in this program: null-terminated (rchar) strings ala 'C' language, UTF-8 (CONSTANT_Utf8_info) strings, and Unicode (jchar)[] strings.

Convert one or UTF-8 (jbyte) bytes to and from Unicode (jchar) characters, plus related functions, like comparison and string length.

Why are these functions called utf_XXX() instead of utf8_XXX()? Originally, they were called such, but when the JDK 1.5 class file spec, section 4, was reviewed (after working with the 1.2/1.4 versions), it was discovered that certain other UTF-xx formats were also provided in the spec, even if not accurately defined. (Due to errors in the revised class file specification, the 21-bit UTF characters (6 bytes) will not be implemented until a definitive correction is located. However, in anticipation of this correction, the functions are now named utf_XXX() without respect to character bit width.) Notice, however, that the spec, section 4, defines a CONSTANT_Utf8 and a CONSTANT_Utf8_info. Therefore, these designations will remain in the code unless changed in the spec.

Control

$URL: https://svn.apache.org/path/name/utf.c $ $Id: utf.c 0 09/28/2005 dlydick $

Copyright 2005 The Apache Software Foundation or its licensors, as applicable.

Licensed under the Apache License, Version 2.0 ("the License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

See the License for the specific language governing permissions and limitations under the License.

Version:
$LastChangedRevision: 0 $
Date:
$LastChangedDate: 09/28/2005 $
Author:
$LastChangedBy: dlydick $ Original code contributed by Daniel Lydick on 09/28/2005.

Reference

Definition in file utf.c.

#include "arch.h"
#include <string.h>
#include "jvmcfg.h"
#include "cfmacros.h"
#include "classfile.h"
#include "nts.h"
#include "util.h"

Go to the source code of this file.

Defines

#define MAP_INVALID_UTF8_TO_QUESTION_MARK
#define RETURN_IF_NUL_BYTE

Functions

static jbyte s1_s2_strncmp (u1 *s1, int l1, u1 *s2, int l2)
 Compare two strings of any length, and potentially neither null-terminated, that is, could be a UTF string.
static void utf_c_dummy (void)
jbyte utf_classname_strcmp (CONSTANT_Utf8_info *s1, ClassFile *pcfs2, jvm_constant_pool_index cpidx2)
 Compare a UTF string containing a formatted or unformatted class name with an unformatted UTF string from constant_pool.
static jbyte utf_common_classname_strcmp (u1 *s1, int l1, ClassFile *pcfs2, jvm_constant_pool_index cpidx2)
 Common generic comparison, all parameters regularized.
jvm_array_dim utf_get_utf_arraydims (CONSTANT_Utf8_info *inbfr)
 Report the number of array dimensions prefixing a Java type string.
rboolean utf_isarray (CONSTANT_Utf8_info *inbfr)
 Test whether or not a Java type string is an array or not.
jbyte utf_pcfs_strcmp (CONSTANT_Utf8_info *s1, ClassFile *pcfs2, jvm_constant_pool_index cpidx2)
 Compare contents of UTF string to contents of a UTF string from a class file structure.
jbyte utf_prchar_classname_strcmp (rchar *s1, ClassFile *pcfs2, jvm_constant_pool_index cpidx2)
 Compare a null-terminated string containing a formatted or unformatted class name with an unformatted UTF string from constant_pool.
jbyte utf_prchar_pcfs_strcmp (rchar *s1, ClassFile *pcfs2, jvm_constant_pool_index cpidx2)
 Compare contents of null-terminated string to contents of a UTF string from a class file structure.
rcharutf_utf2prchar (CONSTANT_Utf8_info *src)
 Convert a UTF string from a (CONSTANT_Utf8_info *) into a null-terminated string by allocating heap and copying the UTF data.
rcharutf_utf2prchar_classname (CONSTANT_Utf8_info *src)
 Convert and an un-formatted class name UTF string (of the type ClassName and not of type [[[LClassName) from a (CONSTANT_Utf8_info *) into a null-terminated string with Java class formatting items. Result is delivered in a heap-allocated buffer. When done with result, perform HEAP_FREE_DATA(result) to return that buffer to the heap.
jshort utf_utf2unicode (CONSTANT_Utf8_info *utf_inbfr, jchar *outbfr)
 Convert UTF8 buffer into Unicode buffer.
cp_info_duputf_utf2utf_unformatted_classname (cp_info_dup *inbfr)
 Strip a UTF string of any class formatting it contains and return result in a heap-allocated buffer.
rboolean utf_utf_isclassformatted (CONSTANT_Utf8_info *src)
 Verify if a UTF string contains class formatting or not.
jbyte utf_utf_strcmp (CONSTANT_Utf8_info *s1, CONSTANT_Utf8_info *s2)
 Compare two UTF strings from constant_pool, s1 minus s2.

Variables

static char * utf_c_copyright = "\0" "$URL: https://svn.apache.org/path/name/utf.c $ $Id: utf.c 0 09/28/2005 dlydick $" " " "Copyright 2005 The Apache Software Foundation or its licensors, as applicable."


Define Documentation

#define MAP_INVALID_UTF8_TO_QUESTION_MARK
 

Value:

*outbfr++ = (jchar) '?'; \
                                         inbfr++
Store a Unicode ? when invalid UTF state found, adj return code

Definition at line 79 of file utf.c.

Referenced by utf_utf2unicode().

#define RETURN_IF_NUL_BYTE
 

Value:

if (UTF8_FORBIDDEN_ZERO == *inbfr) \
                           {return(charcnvcount); }
Detect NUL character and quit when found

Definition at line 83 of file utf.c.

Referenced by utf_utf2unicode().


Function Documentation

static void utf_c_dummy void   )  [static]
 

Definition at line 63 of file utf.c.

jshort utf_utf2unicode CONSTANT_Utf8_info utf_inbfr,
jchar outbfr
 

Convert UTF8 buffer into Unicode buffer.

Parameters:
[in] utf_inbfr UTF string structure
[out] outbfr Buffer for resulting Unicode character string
Returns:
Two returns, one a buffer, the other a count:
*outbfr Unicode version of utf_inbfr string in outbfr

charcnvcount (Return value of function) Number of Unicode characters in outbfr. This will only be the same as length when ALL UTF characters are ASCII. It will otherwise be less than that.

SPEC AMBIGUITY: In case of invalid characters, a Unicode ? is inserted and processing continues. In this way, the result string will still be invalid, but at least it will be proper Unicode. This may prove more than is necessary, but the spec says nothing at all about this matter. Since the NUL character may not appear in UTF-8, if a buffer is terminated by a NUL in the first utf_inbfr->length bytes, termination will be assumed. If a UTF8_FORBIDDEN_xxx character is read, it is converted to a Unicode ? also.

< Looks suspiciously like ASCII NUL

< '\u007f', UTF-8 representation

! Top 4 bits are '1110'

< Top 3 bits are '110'

< Bottom 5 bits contain data bits 10-6

< Move first byte up to bits 10-6

< Looks suspiciously like ASCII NUL

< Top 2 bits are '10'

< Bottom 6 bits contain data bits 0-5

< Bottom 6 bits contain data bits 0-5

! Top 4 bits are '1110'

! Bottom 5 bits contain data bits 15-12

! Bottom 5 bits contain data bits 15-12

! Move first byte up to bits 15-12

< Looks suspiciously like ASCII NUL

! Top 2 bits are '10'

! Bottom 6 bits contain data bits 11-6

! Bottom 6 bits contain data bits 11-6

! Move second byte up to bits 10-6

< Looks suspiciously like ASCII NUL

! Top 2 bits are '10'

! Bottom 6 bits contain data bits 5-0

! Bottom 6 bits contain data bits 5-0

Definition at line 116 of file utf.c.

References CONSTANT_Utf8_info::bytes, MAP_INVALID_UTF8_TO_QUESTION_MARK, RETURN_IF_NUL_BYTE, UTF8_DOUBLE_FIRST_MASK0, UTF8_DOUBLE_FIRST_SHIFT, UTF8_DOUBLE_FIRST_VAL, UTF8_DOUBLE_SECOND_MASK0, UTF8_DOUBLE_SECOND_VAL, UTF8_SINGLE_MAX, UTF8_TRIPLE_FIRST_MASK0, UTF8_TRIPLE_FIRST_SHIFT, UTF8_TRIPLE_FIRST_VAL, UTF8_TRIPLE_SECOND_MASK0, UTF8_TRIPLE_SECOND_SHIFT, UTF8_TRIPLE_SECOND_VAL, UTF8_TRIPLE_THIRD_MASK0, and UTF8_TRIPLE_THIRD_VAL.

rchar* utf_utf2prchar CONSTANT_Utf8_info src  ) 
 

Convert a UTF string from a (CONSTANT_Utf8_info *) into a null-terminated string by allocating heap and copying the UTF data.

When done with result, perform HEAP_FREE_DATA(result).

Parameters:
src Pointer to UTF string, most likely from constant pool
Returns:
Null-terminated string in heap or rnull if heap alloc error.

Definition at line 259 of file utf.c.

Referenced by class_load_primative(), opcode_run(), and utf_isarray().

static jbyte s1_s2_strncmp u1 s1,
int  l1,
u1 s2,
int  l2
[static]
 

Compare two strings of any length, and potentially neither null-terminated, that is, could be a UTF string.

If strings are of equal length, this function is equivalent to strcmp(3). If not of equal length, result is like comparing n bytes of strncmp(3), where non-equal result is returned, but if equal result, it is like n+1, where the final byte is a \0 (NUL) character, so longer string's n+1 character is reported, either as positive value (s1 longer) or as negative value (s2 longer).

This function should be used on ALL string comparisons that potentially involve lack of NUL termination, namely, anything to do with UTF strings of any sort. It is recommended also for any null-terminated string just so all string comparisons work exactly alike, no matter whether (rchar *) or UTF, whether of equal length or not.

Parameters:
s1 (rchar *) to first string
l1 Length of string s1, regardless of any null termination being present or absent in s1.
s2 (rchar *) to second string
l2 length of string s2, regardless of any null termination being present or absent in s2.
Returns:
lexicographical difference of s1 - s2. Notice that the (rchar) data is implicitly unsigned (although the actual signage is left to the compiler), while the (jbyte) result is explicitly signed, due to the arithmetic nature of the calculation.

Definition at line 315 of file utf.c.

Referenced by utf_prchar_pcfs_strcmp(), and utf_utf_strcmp().

jbyte utf_utf_strcmp CONSTANT_Utf8_info s1,
CONSTANT_Utf8_info s2
 

Compare two UTF strings from constant_pool, s1 minus s2.

Parameters:
s1 First of two UTF strings to compare
s2 Second of two UTF strings to compare
Returns:
lexicographical value of first difference in strings, else 0.

Definition at line 379 of file utf.c.

References CP_THIS_STRLEN, PTR_CP_THIS_STRNAME, and s1_s2_strncmp().

jbyte utf_prchar_pcfs_strcmp rchar s1,
ClassFile pcfs2,
jvm_constant_pool_index  cpidx2
 

Compare contents of null-terminated string to contents of a UTF string from a class file structure.

Parameters:
s1 Null-terminated string name
pcfs2 ClassFile where UTF string is found
cpidx2 Index in pcfs2 constant_pool of UTF string
Returns:
lexicographical value of first difference in strings, else 0.

Definition at line 402 of file utf.c.

References CONSTANT_Utf8_info::bytes, CP_THIS_STRLEN, CONSTANT_Utf8_info::length, PTR_CP_THIS_STRNAME, and s1_s2_strncmp().

jbyte utf_pcfs_strcmp CONSTANT_Utf8_info s1,
ClassFile pcfs2,
jvm_constant_pool_index  cpidx2
 

Compare contents of UTF string to contents of a UTF string from a class file structure.

Parameters:
s1 UTF string name
pcfs2 ClassFile where UTF string is found
cpidx2 Index in pcfs2 constant_pool of UTF string
Returns:
lexicographical value of first difference in strings, else 0.

Definition at line 433 of file utf.c.

References BASETYPE_CHAR_L_TERM, CP_THIS_STRLEN, CONSTANT_Class_info::name_index, nts_prchar_isclassformatted(), PTR_CP_ENTRY_CLASS, PTR_CP_THIS_STRNAME, and rtrue.

Referenced by attribute_name_common_find(), field_find_by_cp_entry(), and method_find_by_cp_entry().

static jbyte utf_common_classname_strcmp u1 s1,
int  l1,
ClassFile pcfs2,
jvm_constant_pool_index  cpidx2
[static]
 

Common generic comparison, all parameters regularized.

Compare a UTF or null-terminated string containing a formatted or unformatted class name with an unformatted UTF string from constant_pool. Compare s1 minus s2, but skipping, where applicable, the s1 initial BASETYPE_CHAR_L and the terminating BASETYPE_CHAR_L_TERM, plus any array dimension modifiers. The second string is specified by a constant_pool index. Notice that there are NO formatted class string names in the (CONSTANT_Class_info) entries of the constant_pool because such would be redundant. (Such entries are the formal definition of the class.)

Parameters:
s1 UTF string pointer to u1 array of characters.
l1 length of s1.
pcfs2 ClassFile structure containing second string (containing an unformatted class name)
cpidx2 constant_pool index of CONSTANT_Class_info entry whose name will be compared (by getting its name_index and the UTF string name of it)
Returns:
lexicographical value of first difference in strings, else 0.
< terminator for instance of class

Definition at line 479 of file utf.c.

Referenced by utf_prchar_classname_strcmp().

jbyte utf_prchar_classname_strcmp rchar s1,
ClassFile pcfs2,
jvm_constant_pool_index  cpidx2
 

Compare a null-terminated string containing a formatted or unformatted class name with an unformatted UTF string from constant_pool.

Parameters:
s1 Null-terminated string to compare, containing formatted or unformatted class name (utf_prchar_classname_strcmp() only).
pcfs2 ClassFile structure containing second string (containing an unformatted class name)
cpidx2 constant_pool index of CONSTANT_Class_info entry whose name will be compared (by getting its name_index and the UTF string name of it)
Returns:
lexicographical value of first difference in strings, else 0.

Definition at line 537 of file utf.c.

References CONSTANT_Utf8_info::bytes, CONSTANT_Utf8_info::length, and utf_common_classname_strcmp().

Referenced by opcode_run().

jbyte utf_classname_strcmp CONSTANT_Utf8_info s1,
ClassFile pcfs2,
jvm_constant_pool_index  cpidx2
 

Compare a UTF string containing a formatted or unformatted class name with an unformatted UTF string from constant_pool.

Parameters:
s1 UTF string to compare, containing formatted or unformatted class name.
pcfs2 ClassFile structure containing second string (containing an unformatted class name)
cpidx2 constant_pool index of CONSTANT_Class_info entry whose name will be compared (by getting its name_index and the UTF string name of it)
Returns:
lexicographical value of first difference in strings, else 0.

Definition at line 571 of file utf.c.

References BASETYPE_CHAR_ARRAY, CONSTANT_Utf8_info::bytes, and CONSTANT_MAX_ARRAY_DIMS.

jvm_array_dim utf_get_utf_arraydims CONSTANT_Utf8_info inbfr  ) 
 

Report the number of array dimensions prefixing a Java type string.

No overflow condition is reported since it is assumed that inbfr is formatted with correct length. Notice that because this logic checks only for array specifiers and does not care about the rest of the string, it may be used to evaluate field descriptions, which will not contain any class formatting information.

If there is even a remote possibility that more than CONSTANT_MAX_ARRAY_DIMS dimensions will be found, compare the result of this function with the result of utf_isarray(). If there is a discrepancy, then there was an overflow here. Properly formatted class files will never contain code with this condition.

Note:
This function is identical to nts_get_arraydims() except that it works on (CONSTANT_Utf8_info *) instead of (rchar *).
Parameters:
inbfr CONSTANT_Utf8_info string.
Returns:
Number of array dimensions in string. For example, this string contains three array dimensions:
[[[Lsome/path/name/filename;

If more than CONSTANT_MAX_ARRAY_DIMS are located, the result is zero-- no other error is reported.

< Reference to one array dimension

< Highest number of array dimensions

< Not stated in spec, but implied

Definition at line 617 of file utf.c.

Referenced by class_load_primative().

rboolean utf_isarray CONSTANT_Utf8_info inbfr  ) 
 

Test whether or not a Java type string is an array or not.

Parameters:
inbfr CONSTANT_Utf8_info string.
Returns:
rtrue if this is an array specfication, else rfalse.
< Reference to one array dimension

Definition at line 660 of file utf.c.

References HEAP_GET_DATA, CONSTANT_Utf8_info::length, nts_prchar_isclassformatted(), rfalse, rnull, and utf_utf2prchar().

rchar* utf_utf2prchar_classname CONSTANT_Utf8_info src  ) 
 

Convert and an un-formatted class name UTF string (of the type ClassName and not of type [[[LClassName) from a (CONSTANT_Utf8_info *) into a null-terminated string with Java class formatting items. Result is delivered in a heap-allocated buffer. When done with result, perform HEAP_FREE_DATA(result) to return that buffer to the heap.

This function will work on formatted class names [[[LClassName; and the difference is benign, but that is not its purpose.

Parameters:
src Pointer to UTF string, most likely from constant pool
Returns:
Null-terminated string LClasSName; in heap or rnull if heap alloc error.
< an instance of class '/class/name'

< terminator for instance of class

Definition at line 687 of file utf.c.

rboolean utf_utf_isclassformatted CONSTANT_Utf8_info src  ) 
 

Verify if a UTF string contains class formatting or not.

Parameters:
src Pointer to UTF string, most likely from constant pool
Returns:
rtrue if string is formtted as LClasSName; but rfalse otherwise, may also have array descriptor prefixed, thus [[LClassName;
Note:
This function works just like nts_prchar_isclassformatted() except that it works on (CONSTANT_Utf8_info) strings rather than on (rchar *) strings.
< Reference to one array dimension

< an instance of class '/class/name'

< terminator for instance of class

< an instance of class '/class/name'

Definition at line 759 of file utf.c.

References BASETYPE_CHAR_L_TERM, and rtrue.

cp_info_dup* utf_utf2utf_unformatted_classname cp_info_dup inbfr  ) 
 

Strip a UTF string of any class formatting it contains and return result in a heap-allocated buffer.

When done with this result, perform HEAP_DATA_FREE(result) to return buffer to heap.

Parameters:
inbfr Pointer to UTF string that is potentially formatted as LClassName; and which may also have array descriptor prefixed, thus [[LClassName; . This will typically be an entry from the constant_pool.
Returns:
heap-allocated buffer containing ClassName with no formatting, regardless of input formatting or lack thereof.
Note:
This function works just like nts_prchar2prchar_unformatted_classname() except that it takes a (CONSTANT_Utf8_info) string rather than a (rchar *) string and returns a (CONSTANT_Utf8_info *).

Definition at line 843 of file utf.c.


Variable Documentation

char* utf_c_copyright = "\0" "$URL: https://svn.apache.org/path/name/utf.c $ $Id: utf.c 0 09/28/2005 dlydick $" " " "Copyright 2005 The Apache Software Foundation or its licensors, as applicable." [static]
 

Definition at line 63 of file utf.c.


Generated on Fri Sep 30 18:50:36 2005 by  doxygen 1.4.4