Login

Davider · (This post was last modified: 07-21-2022, 02:31 PM by Davider.)

Hello everyone!

This is a question about text file encoding and code pages

I need to deal with a lot of text files, and their encoding is not uniform, there may be, ANSI, UTF8, UTF-16, GB2312...

The encoding in the example is GB2312 and the code page is 936

Under Powershell, I need to specify the encoding when reading the file, otherwise the read text will be garbled, So in the Powershell code, added code that recognizes the text encoding

Suppose, after reading the file, I need to do a replacement operation, replace "测试" to "正式"

Finally, I need to save it in the original encoding format

Under QM, to perform the above operation, the text must be converted to UTF8, otherwise the replacement operation cannot be completed
But I can't do coding and code page-related programming

Here's the code for powershell, How to implement similar text file encoding and code page recognition under QM?

Thanks in advance for any advice and help
david

Code:

Copy Help

 

$codes = @'

public static class GuessCoder

{

    public static string Detect(string file)

    {

        byte[] data=System.IO.File.ReadAllBytes(file);

        if (data.Length > 2 && data[0] == 0xFF && data[1] == 0xFE){return "Unicode";}

        if (data.Length > 2 && data[0] == 0xFE && data[1] == 0xFF){return "UTF-16BE";}

        if (data.Length > 3 && data[0] == 0xEF && data[1] == 0xBB && data[2] == 0xBF){

            return "UTF-8";

        }else{

            int charByteCounter = 1;

            byte curByte;

            for (int i = 0; i < data.Length; i++)

            {

                curByte = data[i];

                if (charByteCounter == 1)

                {

                    if (curByte >= 0x80)

                    {

                        while (((curByte <<= 1) & 0x80) != 0)

                        {

                            charByteCounter++;

                        }

                        if (charByteCounter == 1 || charByteCounter > 6)

                        {

                            return "GB2312";

                        }

                    }

                }

                else

                {

                    if ((curByte & 0xC0) != 0x80)

                    {

                        return "GB2312";

                    }

                    charByteCounter--;

                }

            }

            if (charByteCounter > 1)

            {

               return "GB2312";

            }

            return "UTF-8";

        }

    }

}

'@;

Add-Type -TypeDefinition $codes





$file_in = "$HOME\Desktop\Test.txt"

$file_ok = "$HOME\Desktop\Test_ok.txt"



$checkenc = [GuessCoder]::Detect($file_in)

$checkenc



$enc = [Text.Encoding]::GetEncoding($checkenc)

$enc



$text = [IO.File]::ReadAllText($file_in, $enc)



$text = $text -replace '测试','正式'



[IO.File]::WriteAllText($file_ok, $text, $enc)

***Gintaras*** · 07-21-2022, 06:57 PM

I see class GuessCoder is in C#. And the PowerShell code uses .NET. Then better to use the new program. It is very similar to QM, but its script language is C#. Would not need to learn the QM language and convert the class. And much easier to convert PowerShell to C# than to QM.

C# code:

Copy

// script ""

var file_in = folders.Desktop + @"Test.txt";

var file_ok = folders.Desktop + @"Test_ok.txt";



var checkenc = GuessCoder.Detect(file_in);

Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);

var enc = Encoding.GetEncoding(checkenc);

print.it(checkenc, enc);



var text = File.ReadAllText(file_in, enc);



text = text.Replace("测试", "正式");



File.WriteAllText(file_ok, text, enc);



public static class GuessCoder

{

    public static string Detect(string file)

    {

        byte[] data=System.IO.File.ReadAllBytes(file);

        if (data.Length > 2 && data[0] == 0xFF && data[1] == 0xFE){return "Unicode";}

        if (data.Length > 2 && data[0] == 0xFE && data[1] == 0xFF){return "UTF-16BE";}

        if (data.Length > 3 && data[0] == 0xEF && data[1] == 0xBB && data[2] == 0xBF){

            return "UTF-8";

        }else{

            int charByteCounter = 1;

            byte curByte;

            for (int i = 0; i < data.Length; i++)

            {

                curByte = data[i];

                if (charByteCounter == 1)

                {

                    if (curByte >= 0x80)

                    {

                        while (((curByte <<= 1) & 0x80) != 0)

                        {

                            charByteCounter++;

                        }

                        if (charByteCounter == 1 || charByteCounter > 6)

                        {

                            return "GB2312";

                        }

                    }

                }

                else

                {

                    if ((curByte & 0xC0) != 0x80)

                    {

                        return "GB2312";

                    }

                    charByteCounter--;

                }

            }

            if (charByteCounter > 1)

            {

               return "GB2312";

            }

            return "UTF-8";

        }

    }

}

Davider · (This post was last modified: 07-21-2022, 09:38 PM by Davider.)

QM code, looks more concise and easier, For C# My level of programming is not very good Smile

I looked up some examples and got the code below

in qm, how Gets the code page for the text encoding?

I looked for some C code for code page

Macro Macro12

Code: Copy      Help
_s.getfile("$desktop$\Test.txt") ;;cp gb2312

;Todo: Gets the code page for the text encoding

_s.ConvertEncoding(936 65001) ;;gb2312 to utf8

_s.findreplace("测试" "正式") ;;replace

_s.ConvertEncoding(65001 936) ;;UTF8 to gb2312

_s.setfile("$desktop$\Test_ok.txt")

C code for code page

Code:

Copy Help

 

#include <stdio.h>

#include <string.h>

#include <stdlib.h>



bool is_str_utf8(const char* str);

bool is_str_gbk(const char* str);



//Judge if it is UTF-8

bool is_str_utf8(const char* str)

{

unsigned int nBytes = 0;//UFT8Can be encoded in 1-6 bytes,ASCIIWith one byte

unsigned char chr = *str;

bool bAllAscii = true;

for (unsigned int i = 0; str[i] != '\0'; ++i) {

chr = *(str + i);

//Determine if asCII is encoded, if not, it is possible that it is UTF8, ASCII is encoded in 7 bits, and the highest bit is labeled 0,0xxxxxxx

if (nBytes == 0 && (chr & 0x80) != 0) {

bAllAscii = false;

}

if (nBytes == 0) {

//If it is not an ASCII code, it should be a multibyte character, which calculates the number of bytes

if (chr >= 0x80) {

if (chr >= 0xFC && chr <= 0xFD) {

nBytes = 6;

}

else if (chr >= 0xF8) {

nBytes = 5;

}

else if (chr >= 0xF0) {

nBytes = 4;

}

else if (chr >= 0xE0) {

nBytes = 3;

}

else if (chr >= 0xC0) {

nBytes = 2;

}

else {

return false;

}

nBytes--;

}

}

else {

//The non-first byte of the multibyte character should be 10xxxxxx

if ((chr & 0xC0) != 0x80) {

return false;

}

//Reduce to zero

nBytes--;

}

}

//Violation of UTF8 encoding rules

if (nBytes != 0) {

return false;

}

if (bAllAscii) { //If it's all ASCII, it's also UTF8

return true;

}

return true;

}



//Judge if it is GB2312

bool is_str_gbk(const char* str)

{

unsigned int nBytes = 0;//GB2312 Can be encoded in 1-2 bytes, Chinese two and one in English

unsigned char chr = *str;

bool bAllAscii = true; //If it's all ASCII,

for (unsigned int i = 0; str[i] != '\0'; ++i) {

chr = *(str + i);

if ((chr & 0x80) != 0 && nBytes == 0) {// Determine whether it is ASCII encoding, if not, it may be GB2312

bAllAscii = false;

}

if (nBytes == 0) {

if (chr >= 0x80) {

if (chr >= 0x81 && chr <= 0xFE) {

nBytes = +2;

}

else {

return false;

}

nBytes--;

}

}

else {

if (chr < 0x40 || chr>0xFE) {

return false;

}

nBytes--;

}//else end

}

if (nBytes != 0) {   //Violation rules

return false;

}

if (bAllAscii) { //If it's all ASCII, it's also GB2312

return true;

}

return true;

}



//Read the file

void read_text(const char* file_name)

{

char line[1024] = { 0 };

FILE *file = fopen(file_name, "rt");

if (!file)

return;

while (1)

{

//End of file read

if (EOF == fscanf(file, "%s", line))

break;

printf("%s\n", line);

}

printf("%d\n", is_str_utf8(line)); 

printf("%d\n", is_str_gbk(line));

fclose(file);



}



//Main function testing

int main() {

read_text("test.txt");

return 0;

}

***Gintaras*** · 07-22-2022, 03:12 AM

Macro Macro3198

Code: Copy      Help
str path.expandpath("$desktop$\Test.txt")

_s.getfile(path) ;;cp gb2312

;Gets the code page for the text encoding

int codePage = CsFunc("" path)

out codePage

_s.ConvertEncoding(codePage 65001) ;;gb2312 to utf8

_s.findreplace("测试" "正式") ;;replace

_s.ConvertEncoding(65001 codePage) ;;UTF8 to gb2312

_s.setfile("$desktop$\Test_ok.txt")

#ret

public static class GuessCoder

{

;;;;public static int DetectCP(string file) {

;;;;;return System.Text.Encoding.GetEncoding(Detect(file)).CodePage;

;;;;}

;;;;public static string Detect(string file)

;;;;{

;;;;;;;;byte[] data=System.IO.File.ReadAllBytes(file);

;;;;;;;;if (data.Length > 2 && data[0] == 0xFF && data[1] == 0xFE){return "Unicode";}

;;;;;;;;if (data.Length > 2 && data[0] == 0xFE && data[1] == 0xFF){return "UTF-16BE";}

;;;;;;;;if (data.Length > 3 && data[0] == 0xEF && data[1] == 0xBB && data[2] == 0xBF){

;;;;;;;;;;;;return "UTF-8";

;;;;;;;;}else{

;;;;;;;;;;;;int charByteCounter = 1;

;;;;;;;;;;;;byte curByte;

;;;;;;;;;;;;for (int i = 0; i < data.Length; i++)

;;;;;;;;;;;;{

;;;;;;;;;;;;;;;;curByte = data[i];

;;;;;;;;;;;;;;;;if (charByteCounter == 1)

;;;;;;;;;;;;;;;;{

;;;;;;;;;;;;;;;;;;;;if (curByte >= 0x80)

;;;;;;;;;;;;;;;;;;;;{

;;;;;;;;;;;;;;;;;;;;;;;;while (((curByte <<= 1) & 0x80) != 0)

;;;;;;;;;;;;;;;;;;;;;;;;{

;;;;;;;;;;;;;;;;;;;;;;;;;;;;charByteCounter++;

;;;;;;;;;;;;;;;;;;;;;;;;}

;;;;;;;;;;;;;;;;;;;;;;;;if (charByteCounter == 1 || charByteCounter > 6)

;;;;;;;;;;;;;;;;;;;;;;;;{

;;;;;;;;;;;;;;;;;;;;;;;;;;;;return "GB2312";

;;;;;;;;;;;;;;;;;;;;;;;;}

;;;;;;;;;;;;;;;;;;;;}

;;;;;;;;;;;;;;;;}

;;;;;;;;;;;;;;;;else

;;;;;;;;;;;;;;;;{

;;;;;;;;;;;;;;;;;;;;if ((curByte & 0xC0) != 0x80)

;;;;;;;;;;;;;;;;;;;;{

;;;;;;;;;;;;;;;;;;;;;;;;return "GB2312";

;;;;;;;;;;;;;;;;;;;;}

;;;;;;;;;;;;;;;;;;;;charByteCounter--;

;;;;;;;;;;;;;;;;}

;;;;;;;;;;;;}

;;;;;;;;;;;;if (charByteCounter > 1)

;;;;;;;;;;;;{

;;;;;;;;;;;;;;;return "GB2312";

;;;;;;;;;;;;}

;;;;;;;;;;;;return "UTF-8";

;;;;;;;;}

;;;;}

}

Davider · (This post was last modified: 07-22-2022, 04:25 AM by Davider.)

@Gintaras
Thanks for your help, it works well

I have a question
the example above has only one file, if there are two thousand files I'm going to traverse, each file is about 2M in size, the speed of using C# functions, and using QM code completely, about how much difference?
What do you suggest? Thanks again
As I currently know, using powershell is slow Smile

***Gintaras*** · 07-22-2022, 04:51 AM

The fastest is pure C# with the new program.
This code with the C# Detect function is fast too.
The slowest would be the Detect function converted to QM.

To measure code speed, use PerfX functions.
Example 1:
Macro Macro3210

Code: Copy      Help
PerfFirst

0.01; ;;code example 1

PerfNext

0.02; ;;code example 2

PerfNext

PerfOut

Example2:
Macro Macro3198

Code: Copy      Help
PerfFirst

rep 3

,int codePage = CsFunc("" path)

,PerfNext

PerfOut

out codePage

To make this code faster when calling the C# function many times, replace the CsFunc line with:

Code: Copy      Help
CsScript c.AddCode("") ;;once

int codePage = c.Call("DetectCP" path) ;;for each file

Davider · 07-22-2022, 12:51 PM

Thanks a lot

Login
Username:
Password:	Lost Password?
	Remember me