[freetds] iconv madness

Discussion:

Frediano Ziglio

2014-02-15 17:32:53 UTC

Hi,
I'm going probably to refactor a bit the iconv stuff.

1- remove of the double conversion. Some iconv implementations does
not allow conversions freely from any charset to any other allowing
mainly conversion from/to a specific charset (ie ucs4 or ucs2). These
environment will have to use GNU libiconv. The code to handle it is
quite ugly and not tested anymore and was only introduced to handle
HP-UX. It would be good to have a test program or better a test in
configure.. I'll see what I can do

2- remove tds_iconv in favour of tds_convert_stream. There are mainly
3 functions that convert a string in chunks: tds_iconv,
tds_convert_stream and tds_put_string. read_and_convert was another
but was converted to use tds_convert_stream. tds_put_string will be
converted to use tds_convert_stream too. Now mostly conversion would
go through tds_convert_stream and tds_iconv which both do conversion
in chunks. This cause a lot of strange errors and code is quite
complicated. A single function will make things easier and faster. The
result function will be quite complicated but no more than tds_iconv
now!

3- other uses could be easily be converted to use tds_convert_stream
(probably with a simple interface)

4- tds_get_char_data and callers... well... a strange mix that
probably could be simplified making more dynamically.
tds_get_char_data already handle binary data and BLOBs, it's not clear
who handle column data.

Yes, code is actually a lot complicated. Some refactoring are needed.
Row handling is ugly too.

Frediano

James K. Lowden

2014-02-16 20:41:34 UTC

Permalink

On Sat, 15 Feb 2014 17:32:53 +0000

Post by Frediano Ziglio
I'm going probably to refactor a bit the iconv stuff.

Hi Freddy,

Since the iconv mess is mostly my fault, I have a few suggestions in
hindsight.

There are only two character-set conversion requirements:

1. On the wire, to convert TDS 7+ SQL text to/from UCS-2.

2. In the client, to convert character data to/from the client's
character set for bound buffers (whether dbbind, dbrpcbind or
bcp_bind).

The policy we adopted was to convert all character data "at the wire".
The read/write functions performed the conversion, and all character
data were kept in the client's preferred encoding.

That policy was a mistake. Better not to touch the data unless and
until it's necessary, because that simplifies some things that should
be simple, like binary bindings. Only when the encoding must be
changed -- for the protocol, or for the client -- should the conversion
take place.

It's too bad the iconv standard API doesn't include I/O functions,
because for our purposes functions modelled on printf/scanf would be
easier to use. I suppose that's what tds_convert_stream is for.

Apart from protocol requirements, character-set conversion is entirely
a client library function. The user expects characters to be
faithfully rendered on the screen, no matter how they're stored in
database or transmitted on the wire. In effect, all
*client-provided* character buffers implicitly use the client's
encoding. That means dbdata() returns the unconverted buffer and
dbbind() enlists conversion (if need be).

There is an exception: bcp. Sybase's bcp utility supports a -j option
that defines the file's encoding, which may be different from the
encoding in the user's environment. The user may well have a file --
or require a file -- in a "foreign" encoding, say a UCS-2 file
destined for a Windows machine but created on a utf-8 Linux box.
There is no reason freebcp can't support that, but the current design
makes it more difficult than it should be.

It would also be nice if, as a byproduct of your work, freebcp could
read from a pipe. I created the misbegotten requirement that the
stream be seekable to avoid memory pressure. With today's larger
memories, freebcp should just realloc as needed to hold a column
buffer.

HTH & regards,

--jkl

Frediano Ziglio

2014-02-17 18:52:46 UTC

Permalink

Post by James K. Lowden
On Sat, 15 Feb 2014 17:32:53 +0000

Post by Frediano Ziglio
I'm going probably to refactor a bit the iconv stuff.

Hi Freddy,
Since the iconv mess is mostly my fault, I have a few suggestions in
hindsight.
1. On the wire, to convert TDS 7+ SQL text to/from UCS-2.
2. In the client, to convert character data to/from the client's
character set for bound buffers (whether dbbind, dbrpcbind or
bcp_bind).
The policy we adopted was to convert all character data "at the wire".
The read/write functions performed the conversion, and all character
data were kept in the client's preferred encoding.
That policy was a mistake. Better not to touch the data unless and
until it's necessary, because that simplifies some things that should
be simple, like binary bindings. Only when the encoding must be
changed -- for the protocol, or for the client -- should the conversion
take place.

I don't really blame you at all. At the time we choose this dblib was
the main library which does not support wide characters at all. Also
the Unix world did not liked that much wide characters. The results
was to convert to a single/multi-byte string. Honestly even now I
think people use dblib with PHP and other tools where having utf-8
strings is much better than having a mix of characters. Considering
that every column in mssql have its encoding and that dblib does not
provide a way to query which encoding was used... well... hard to
switch!

I agree however should be moved to upper layer (well... ODBC already
does this!).

Post by James K. Lowden
It's too bad the iconv standard API doesn't include I/O functions,
because for our purposes functions modelled on printf/scanf would be
easier to use. I suppose that's what tds_convert_stream is for.

cfr http://freetds.sourceforge.net/up/doxy/a00255.html for details
(uploaded some minutes ago, documentation updated this weekend!).
Mainly instead of using buffer of characters use some structures with
callbacks. In this way you can convert how much data we want!
Also this will solve possible problems with iconv conversion states
(some ugly piece of code in tds_iconv).

Post by James K. Lowden
Apart from protocol requirements, character-set conversion is entirely
a client library function. The user expects characters to be
faithfully rendered on the screen, no matter how they're stored in
database or transmitted on the wire. In effect, all
*client-provided* character buffers implicitly use the client's
encoding. That means dbdata() returns the unconverted buffer and
dbbind() enlists conversion (if need be).
There is an exception: bcp. Sybase's bcp utility supports a -j option
that defines the file's encoding, which may be different from the
encoding in the user's environment. The user may well have a file --
or require a file -- in a "foreign" encoding, say a UCS-2 file
destined for a Windows machine but created on a utf-8 Linux box.
There is no reason freebcp can't support that, but the current design
makes it more difficult than it should be.

Yes, bcp, rows, type handling, all a big mess. The idea that libTDS
create and allocate a structure while some fields are not in row is...
well... not that good. Perhaps every column allocated on its own would
be simpler. I also don't quite understand the relationship between
type and binary representation. Every type have a representation but
blobs share the initial pointer... Not also counting that varchar
normally are allocated as row static sized buffers while they are
limited to 8000 characters which could be a big waste of memory.
Mainly row idea is good for small data (ints, floats, date and so on)
but for possibly large buffers could be an issue. Well... it's surely
true that data is allocated once and you have less check for out of
memory but perhaps this was a problems years ago! Perhaps having
column_data pointing to just data would even be easier for upper
layers instead of having always to check for is_blob_col to get the
pointer to data!

On similar lines I'm moving to our DSTR implemenation even for column
names (already changed for tables or less important fields). Still,
more checks to do and more memory fragmentation but more manageable
code. Knowing that a string is a DSTR and not sometimes a char*,
sometimes a DSTR and sometimes a char[] make thinking easier!

Post by James K. Lowden
It would also be nice if, as a byproduct of your work, freebcp could
read from a pipe. I created the misbegotten requirement that the
stream be seekable to avoid memory pressure. With today's larger
memories, freebcp should just realloc as needed to hold a column
buffer.

Well... this was implemented months ago (fseek is used only to copy
data in the error file) :) I even optimized the terminator recognition
and I'll improve it even more. The final stage would be to support
varchar/varbinary(max) making possible to insert unknown length data
(for instance converting strings on the fly!) to wire using few kb of
memory! varchar/varbinary(max) have the concept of streamable data
making possible to not specify the length at the beginning.

Post by James K. Lowden
HTH & regards,
--jkl

Frediano