Better Encoding and Newline Support In The Diff Algorithms [NOTE: This is work-in-progress.] Introduction ============ Currently, the diff handling routines in libsvn_diff know nothing about character encodings and eol characters. It assumes an ASCII-based encoding and LF as line separator. This leads to a lot of problems: * Diff output will be inconsistently encoded. * Files with different line endings cause unexpected results (i.e. CR line endings). * Diff output gets inconsistent line endings. * Non-ASCII based encodings, such as UTF16 aren't supported at all by subversion out-of-the-box. Solving this situation seems to be a lot of work. The motivation for starting this was issue #1533 'diff output doesn't use correct encoding'. This issue is solved, making the diff code assume the locale encoding for file contents rather than UTF8, but the problems discussed in this file are still present. Header Encoding =============== Currently, the headers are written using the locale encoding, which is not always what's wanted. If the encoding of the files is known (via svn:mime-type, for example), the headers should probably be written using that encoding. Note that this applies to property change information and property values in the svn: namespace as well. For other properties, we can't do anything but treat them as opaque. Newlines ======== According to the GNU diff documentation, on systems with newline separators other than just LF, the newlines are normalized to the system markers, except when --binary is used. Currently, our diff library understands nothing but LF as newline. Making it accept CRLF and CR as well is not hard. Since we know the newline marker used in the file via the svn:eol-style property, we can handle this quite well. If svn:eol-style is not set, I suggest we output newlines as-is, and use APR_EOL_STR to output newlines in headers. That's consistent with how GNU diff behaves with the --binary option. When svn:eol-style is set, we should use that style for the headers. The values might be different for the original and the new file; it seems logical to use the value from the modified file. Note that in this case, newlines will be inconsistent anyway. Also, the libsvn_client should make sure the files are translated into their newline style before comparing them (this is necessary since working files don't have their newlines normalized if svn:eol-style is changed in the working revision). In the usual case, when svn:eol-style is not changed, this will give consistent newlines for the whole diff. If svn:eol-style is changed, the diff will contain every line in the file with eol marker changes. This is what happens currently if you do a repos_to_repos diff with svn:eol-style changed. If svn:eol-style is set to native, then APR_EOL_STR should be used, as usual. This requires that the svn_client_diff* functions read the svn:eol-style property of the modified file and pass that information to svn_diff_file_output_unified. svn_diff_file_output_unified needs an eolstr argument, giving the newline marker to use for headers. Content Encoding ================ To support encodings that aren't ASCII-based (meaning that the first 128 bytes always means the same as in ASCII), Subversion needs to know the encodings of the files being diffed. We don't currently have a canonical way of detecting the encoding. It has been suggested to use the charset parameter of svn:mime-type for this purpose. Whatever method we choose, we need to cope with the fact that not all files have this information available. In this case, we might assume the locale/console encoding. When the encodings of the files are known, the diff tokenizer should use that to decide what newline separator it expects. A simple solution is to just recode "\n", "\r\n" and "\r" into the file encodings and search for that. Beware that to support UTF16 and other forms of Unicode, we need to support null bytes in these strings. NOTE: Supporting non-byte-oriented encodings such as UTF16 will require work in other parts of the client libraries as well. I'm discussing it here to not design a solution where we can't support that in the future. To support this, svn_diff_file_diff will need arguments for the encodings of the original and modified files. Merge ===== Merging (i.e. diff3) can be handled in similar ways to diff. The eol-style of the .mine file should be used for the conflict markers and the files should be translated to their newline styles if needed. The encoding part is a bit trickier. If the encoding of all the three files is the same, then conflict markers should use that encoding as well. NOTE: For UTF16 and UTF32, the BOM might be problematic. Ideally, we need to be careful to not add extra BOMs inside the file. One idea is to strip the BOMs before merging and ensure that the resulting file has a BOM after the merge. I'm not sure how much encoding specific code we want to add to our diff library. Maybe UTF16 would be considered common enough to not handle it like "just another encoding". For UTF8, we may need to handle the BOM as well, since that's allowed. We need to be careful not to add BOMs that aren't in the files, since that will break applications (and we don't want to silently change the contents of users' files!) If the encodings are different for the three files, merging could easily lead to an inconsistent mess, unless the encodings share some subset (like when changing from US-ASCII to UTF-8). I think we should leave those rare cases to the user, who can recode and merge by hand or use some other tool.