Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Solved] HTML parsing galore
#2
HtmlDoc can get HTML or text of specified tag. To get what is inside, use string functions, eg findrx. Also can be used html element functions.

Macro Macro1090
Code:
Copy      Help
str s=
;<body>
;<div class='text'>
;<b>Covers</b><br/>
;http://xxxyyyyzzzz.com/somefile.html <- i want that
;</div>
;</body>

HtmlDoc d.InitFromText(s)
;str s2=d.GetHtml("div" 0)
str s2=d.GetText("div" 0)
;out s2
str s3
if(findrx(s2 "\bhttp:\S+" 0 1 s3)<0) ret
out s3

Often you can easily extract required strings from whole page HTML using findrx. Use HtmlDoc only when it is too difficult. HtmlDoc uses IE HTML parsing engine to parse page HTML into smaller elements. Then you find required elements, and work with their text or HTML using string functions.

containerTag is HTML tag name, like div. To find first div, use d.GetText("div" 0), to find next div, use d.GetText("div" 1), and so on.

HtmlDoc.d and d3 are variables of type IHTMLDocument2 and IHTMLDocument3. Both can be used to access MSHTML DOM. Documented in MSDN library.


Messages In This Thread

Forum Jump:


Users browsing this thread: 1 Guest(s)