Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extract python/other programming code from html (converted from Jupyter notebook)
#1
Hi Dear Gintaras,
I would like to ask if it is possible to use LA to extract just the python code from a html (convert from jupyter notebook) ? Thank you very much!  Here is the html file Car price


Attached Files
.zip   CIS 512- Car Price Prediction-0719.html.zip (Size: 890.36 KB / Downloads: 131)
#2
birdywen, might I ask if you first converted the Jupyter notebook with e.g. "jupyter nbconvert thenotebook.ipynb --to html"? If you did, you can extract the python code by instead running "jupyter nbconvert thenotebook.ipynb --to python", which will give you a runnable script. If you only have the already-converted file to work with, there may be several ways to do this, with or without LA.
Regards,
burque505
#3
Code:
Copy      Help
// script "财联社.cs"
/*/
nuget -\HtmlAgilityPack;
/*/

using System;
using System.Xml;
using HtmlAgilityPack;
///                    
public class Program
{
    public static void Main()
    {

        
        #region example
            
        var path = @"C:\Users\birdy\Desktop\CIS 512- Car Price Prediction-0719.html";
        
        var doc = new HtmlDocument();
        doc.Load(path);
        
        var node = doc.DocumentNode.SelectNodes("/html/body/div/div[1]/div[2]/div[2]/div/div/pre");
        foreach (var t in node) {
            
            print.it(t.InnerText);
        }

        
        #endregion
    }
}
    
    

[Image: UXapDhs.jpg]

How to replace them with  ' ', " ", >, < Thank you so much!
#4
[Image: UXapDhs.jpg]
How to replace them with ' ', " ", < , >
#5
It's strange that the library does not automatically replace HTML entities. But namespace HtmlAgilityPack has a class for it.

print.it(HtmlEntity.DeEntitize(t.InnerText));
#6
Wow, It worked very well! Thanks
#7
Hi Gintaras, 
It's that possible for LA to read the UI inner text? Just like reading the inner text like in  HtmlAgilityPack.  The reason I ask this question is because sometimes the webpage content what I want to get is required to login in first, but that is impossible for HtmlAgilityPack to extract text without logging on the account to the specific website.
Thank you!
#8
I know 2 ways, but probably more exist. Google: "C# extract Chrome web page element text".
Get HTML with elm.Html. Then somehow convert HTML to text, for example using regular expressions or HtmlAgilityPack.
Use Selenium. But it has problems connecting to existing web browser window. Look in Cookbook.
#9
Hi Gintaras,

Your method of elm.Html. with HtmlAgilityPack is perfect combination. Now I can easily extract any text or other format (I mean any Elm) I wanted from any webpage.
That's so nice!
Thank you so much! 
 
Code:
Copy      Help
// script ""
/*/ nuget -\HtmlAgilityPack; /*/

//https://zerotomastery.io/cheatsheets/python-cheat-sheet/


using HtmlAgilityPack;

var w = wnd.find(1, "The Best Python Cheat Sheet | Zero To Mastery - Google Chrome", "Chrome_WidgetWin_1");
foreach (var e in w.Elm["web:GROUPING", prop: "@id=cheatsheet-content"]["TEXT", prop: "level=2"].FindAll()) {
var html = e.Html(false);
var doc1 = new HtmlDocument();
doc1.LoadHtml(html);
print.it(doc1.DocumentNode.InnerText);    
}


Forum Jump:


Users browsing this thread: 2 Guest(s)