Charset of a file (ASCII, UTF-8) and Lotusscript

Dear All,

Is it possible to know the charset of a file (like ASCII, UTF-8 etc … ) using Lotusscript?

I use the following Lotusscript code:

Open file for input as fileNum charset=“UTF-8”

The problem is that sometimes, I have an ASCII file and I would like to change the charset to ASCII. How can I know the charset of a file using Lotusscript.

Thanks in advance,

Mikaël Donini, Arkadin France.

Subject: Re: Knowing the character set of an input file

If the file has a byte order mark, you can read it by opening the file first as a binary NotesStream and reading off the first few bytes.

If it’s XML, open it with a NotesStream and use the XML classes; they’ll figure out what they need to.

If it’s just a text file and the character set isn’t indicated in any way – that’s harder. How would you tell what the character set of a file is, from looking at the hex dump, if the choices are wide open? If the characters in the file are valid characters in more than one character set, you would have to do it by deciding whether the data makes sense in the context – e.g. you know if you see the character stream "whateverラ " that having a katakana character right after an English word is unlikely and you should try a different character set. But a computer lacks the intelligence to reliably discern what “looks right”.

If you come up with an algorithm, you can open the file as binary, read the actual bytes and have the program decide, then close the file and reopen with the right character set. But that’s going to be a complex algorithm, I think.

Can’t you arrange for all the files to be in the same character set? Do they come from different sources? Or arrange to have them stored in such a way that you can tell which are which, e.g. in different folders, or with some indication in the filename, or a little header or BOM?

Subject: CSV files

Thank you a lot for this valuable answer. In fact, the concerned files are CSV files.

These files contain international data that are imported in Notes via my program.

And these files can be ANSI files (for English, French …) but also UTF-8 files (for Japanese people for instance).

These files are created by the users in Microsoft Excel. Japanese Excel version creates UTF-8 csv files whereas English version of Excell generates ANSI file.

So my problem seems difficult to solve. I have tried to read the BOM characters using “binary” but as you said : “it is difficult to have a clean algorithm”.

Thank you again Andre for your expertise and time.

Mikaël.

Subject: Not so complex :slight_smile:

Feel free to use this snippet. It is copypasted from C and might contain syntax errors.The idea: If file contains a byte which is not part of a valid UTF-8 sequence, treat is as Extended ASCII. Otherwise, it is UTF-8.

Note that if there are no bytes > 127, it will report UTF-8, which is safe.

Function badUTF8 …

Dim index as Integer

Dim text() As Byte

Dim badUTF8 As Boolean

Dim count As Integer

Dim trailers As Integer

index = 0

badUTF8 = true

Do While index < Ubound(text)

If text(index) < 128 Then

'00-7F Plain 7-bit ASCII

GoTo LCONTINUE                     

ElseIf text(index) >= 194 And text(index) <= 223 Then

'C2–DF, first byte of two-byte sequence      

trailers = 1 

ElseIf text(index) >= 224 And text(index) <= 239 Then

'E0–EF, first byte of three-byte sequence

trailers = 2 

ElseIf text(index) >= 240 And text(index) <= 244 Then

'F0–F4, first byte of four-byte sequence

trailers = 3 

Else

'Invalid UTF-8

Exit Function 

End If

For count = 1 To trailers

index = index + 1

If index >= Ubound(text) Then

  Exit Function

EndIf

If text(index) < 128 Or text(index) > 191 Then

  'trailers are in 80–BF

  Exit Function

EndIf

Next

LCONTINUE:

index = index + 1

Loop

'If reached this point, no bad UTF-8 characters were encountered

badUTF8 = False

End Function

Subject: Correction…

The last lines must be:

'If reached this point, no bad UTF-8 characters were encountered

badUTF8 = False

End Function

Subject: Thank you …

Thank you a lot for your time.

I am going to test it.