Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extracting data from web page (table etc); will HtmlDoc be included in LA?
#1
Heart 
Hi Dear Gintaras,

I was worried I will be kick off from the forum for asking too many "stupid" question.

I would like to ask if the HtmlDoc function will be added to LA or it is already in somewhere in LA? Because I was not able to translate code from qm2 to LA,(I am new to programming).Could you please help with this issue? 

https://www.opera-arias.com/arias/ 
I was trying to extract the arias information to a csv file, there are 100+ pages and I was wondering if that is possible to write "for loop" code to automatically determine the "next" page button and then click after each cycle. Thank you so much for any help!
#2
Quote:I was worried I will be kick off from the forum for asking too many

Your questions help in LA development. Testing LA with real tasks, etc.
 
Quote:Will HtmlDoc be included in LA?

Unlikely. Use other libraries, for example HtmlAgilityPack. Look in Cookbook -> Internet -> Parse HTML.

In this case can be used elm.

Code:
Copy      Help
// script "Opera arias.cs"
//https://www.opera-arias.com/arias/#x

print.clear();
var csv = new csvTable { ColumnCount = 9 };

var w = wnd.find(1, "Opera Arias *- Google Chrome", "Chrome_WidgetWin_1");

for (; ; ) {
    _Page();
    //break;
    var next = w.Elm["web:LINK", "Next"].Find(-1);
    if (next == null) break;
    next.WebInvoke();
}


print.it(csv);

void _Page() {
    var table = w.Elm["web:GROUPING", prop: "@id=table_div"].Find(1);
    
    //Some cells are empty, and there are no elms for empty cells, therefore cell indices become incorrect.
    //Solution: at first get column x offsets from the header row. Then can skip empty cells.

    var header = table.Navigate("pr");
    var ax = header.Elm["LINK"].FindAll().Select(o => o.Rect.CenterX).ToArray();
    
    for (var row = table.Navigate("fi"); row != null; row = row.Navigate("ne")) {
        var cells = new string[csv.ColumnCount];
        var cell = row.Navigate("fi ne");
        for (int i = 0; i < csv.ColumnCount; i++) {
            if (i > 0) { cell = cell.Navigate("ne"); if (cell == null) break; }
            
            //correct column index for empty cells
            for (int x = cell.Rect.left; x > ax[i] && x != 0;) i++;
            
            var s = i switch { 1 => cell.HtmlAttribute("style")[6..^2], 6 => cell.Navigate("fi").Name, _ => cell.Name };
            
            if (i == csv.ColumnCount - 1) { //the last column. Some cells consist of multiple elements.
                while ((cell = cell.Navigate("ne")) != null) s += cell.Name;
            }

            
            cells[i] = s;
        }

        csv.AddRow(cells);
    }
}
#3
Hi Dear Gintaras,

I have tried the code but got nothing to shown on the output dashboard. I have also tried to save the csv to desktop folder also got nothing. Please help me to figure it out? By the way, Does this code work for any cases similar table on webpage?(after modifying)

Thank you so much for the code. It seems hard for me to understand every step.  "var s = i switch { 1 => cell.HtmlAttribute("style")[6..^2]" what is this for?
#4
The script runs maybe > 30 s. Then prints the CSV. Tested, never fails.

I tested in Chrome 113 + "uBlock Origin" extension.

If does not work, use print.it(...) to debug the script.

That line gets some more useful text than just cell.Name (I guess). If don't need it, replace the line with var s = cell.Name;.
#5
I am so sorry for the previous reply. It works very smoothly. I waited a few seconds without seeing any content I turned off the code. I didn't notice that the results will be displayed after all the code has finished running.(After the whole for loop). And I have also test with other similar website to extract table content. It can also work smoothly. I love love love LA.....!

Gintaras, I couldn't think of any words to express my gratitude!
#6
Hi Gintaras,

I just noticed that some cells is like this "Musetta/Alcindoro/Mimì/Rodolfo", but the code will just extract the first of them like "Musetta" instead of the whole group of text. Will that be possible to join them together as a whole cell?  For this case, please see the page 1 row 25 the last cell. 

Thank you so much!
#7
Fixed. Please use the updated code.
#8
Works great! No technical service can be faster and more efficient than you!Thanks
#9
Hi Gintaras,

I am still studying this code.  The line I marked can be replaced by var cell = row.Navigate("ch2") . Why did you use ("fi ne")?  I didn't understand it until I tried using ch2. Could you please explain why you use ”fi ne” ? Any difference between them?

Thank you so much!
[Image: Gf2t32k.png]
#10
No difference. It means "get the first child, then its first sibling". The result is the same as ch2, which means "get the second child".
#11
Great! Thank you!
#12
Hi Gintaras, How to make loop to turn page if there is nothing like "next" button on page?  Thanks!
https://notes-box.com/musicians/
#13
One of ways - list of link names.

Code:
Copy      Help
// script "notes-box.cs"
//https://notes-box.com/musicians/a/

print.clear();

var w = wnd.find(0, "* Google Chrome", "Chrome_WidgetWin_1");
foreach (var v in "0-9,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z".Split(',')) {
    var e = w.Elm["web:LINK", v].Find(5);
    print.it(e);
    e.Invoke();
    e = w.Elm["web:GROUPING", "Artist* " + v].Find(5);
    var links = e.Parent.Elm["LINK", prop: "level=0"].FindAll();
    print.it(links.Length);
}
#14
Code:
Copy      Help
// script "notes-box.cs"
//https://notes-box.com/musicians/a/

print.clear();

var w = wnd.find(0, "* Google Chrome", "Chrome_WidgetWin_1");
foreach (var v in "0-9,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z".Split(',')) {
//foreach (var v in "A".Split(',')) { //test 1 page
    
    //open page

    
    var e = w.Elm["web:LINK", v].Find(5);
    print.it(e.Name);
    e.Invoke();
    e = w.Elm["web:GROUPING", "Artist* " + v].Find(5);
    
    //get all artists
    
    //show all

    var all = w.Elm["web:LINK", "All"].Find(-1);
    if (all != null) {
        all.Invoke();
        e = w.Elm["web:LINK", "Paged"].Find(10);
    }

    
    var links = e.Parent.Elm["LINK", prop: "level=0"].FindAll();
    print.it(links.Length);
}
#15
Thank you.This is so creative.I found that there are always more solutions than problems.And there are many ways to do it.It's amazing!
#16
Hi Gintara, Sorry to bother you again. There is no button number on this page https://filecr.com/ms-windows/?id=685550968000. All button have the same "pagination". I tried to use the method you had used before but failed. Any solution to this issue? I very appreciate your help. 
 
Code:
Copy      Help
var w = wnd.find(1, "* - Google Chrome", "Chrome_WidgetWin_1");

for (var e = w.Elm["web:BUTTON", "pagination"].Find(); e != null; e = e.Navigate("ne")) {
    e.Invoke();
    2.s();
}
#17
Try Selenium. It is more reliable for web browser automation. For example can reliably wait until web page loaded. But not so easy to use, and will need to install 2 NuGet packages and update them for each new Chrome version.
 
Code:
Copy      Help
// script ""
/*/ nuget selenium\Selenium.Support; nuget selenium\Selenium.WebDriver.ChromeDriver; /*/
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Interactions;
using OpenQA.Selenium.Support.Extensions;
using OpenQA.Selenium.Support.UI;
script.setup(trayIcon: true, sleepExit: true, exitKey: KKey.MediaStop, pauseKey: KKey.MediaPlayPause);
print.clear();

//Starts new Chrome instance.

ChromeOptions options = new();
//Enable and maybe edit this if want to use an existing profile. To get profile path, in Chrome open URL "chrome://version/".
//    Then before starting this script also may need to close existing Chrome instances that use this profile.
//options.AddArguments($"user-data-dir={folders.LocalAppData + @"Google\Chrome\User Data"}", "profile-directory=Profile 1");


var service = ChromeDriverService.CreateDefaultService();
service.HideCommandPromptWindow = true;
using var driver = new ChromeDriver(service, options);
driver.Manage().Window.Maximize();

for (int i = 1; i <= 5; i++) {
    script.pause();
    driver.Navigate().GoToUrl($"https://filecr.com/ms-windows/?page={i}"); //opens and waits until loaded
    1.s();
}


1.s();
dialog.show("Close web browser", x: ^1);



The same way can be used without Selenium, but I cannot find a reliable way to wait until web page loaded, therefore will need to add delays and it makes the script slower.
#18
This script uses an existing Chrome instance which must be started with a special command line.

Code:
Copy      Help
// script ""
/*/ nuget selenium\Selenium.Support; nuget selenium\Selenium.WebDriver.ChromeDriver; /*/
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Interactions;
using OpenQA.Selenium.Support.Extensions;
using OpenQA.Selenium.Support.UI;
script.setup(trayIcon: true, sleepExit: true, exitKey: KKey.MediaStop, pauseKey: KKey.MediaPlayPause);
print.clear();

//Chrome must be launched with command line like this:
//run.it("chrome.exe", "--remote-debugging-port=9222");
////run.it("chrome.exe", $"--remote-debugging-port=9222 --user-data-dir=\"{folders.LocalAppData + @"Google\Chrome\User Data"}\"");


ChromeOptions options = new() { DebuggerAddress = "127.0.0.1:9222" };
var service = ChromeDriverService.CreateDefaultService();
service.HideCommandPromptWindow = true;
using var driver = new ChromeDriver(service, options);
driver.Manage().Window.Maximize();

for (int i = 1; i <= 2; i++) {
    script.pause();
    driver.Navigate().GoToUrl($"https://filecr.com/ms-windows/?page={i}"); //opens and waits until loaded
    1.s();
}
#19
Thank you so much.It’s so powerful to combine LA with these packages.


Forum Jump:


Users browsing this thread: 2 Guest(s)