Posts: 795
Threads: 136
Joined: Feb 2009
Hello,
i'm now fighting with html text extraction.
I've got several different type of data to extract. I tried hard with HTMLDoc class, with provided examples, but it's not enough.
NB: red line is wanted text
1)
<div class='text'>
<b>Covers</b><br/>
http://xxxyyyyzzzz.com/somefile.html <- i want that
</div>
2)
<a href="
http://xxxyyyyzzzz.com/somefile.html" target="_blank">
http://xxxyyyyzzzz.com/somefile.html</a></div>
3)
<div class="image">
<a href="
http://xxxyyyyzzzz.com/somefile" target="_blank"><img src="
http://xxxyyyyzzzz.com/somefile.jpeg"
4)
<a href="
http://xxxyyyyzzzz.com/somefile" target="_top">Download</a><br>
5)
dd.d3.getElementById("lgpd").outerHTML : why d3, what is it.
6)where to find containerTag & containerNameOrIndex reference?
Long post but long time search :/
kind regards,
Laurent.
Posts: 12,073
Threads: 140
Joined: Dec 2002
HtmlDoc can get HTML or text of specified tag. To get what is inside, use string functions, eg findrx. Also can be used html element functions.
Macro
Macro1090
str s=
;<body>
;<div class='text'>
;<b>Covers</b><br/>
;http://xxxyyyyzzzz.com/somefile.html <- i want that
;</div>
;</body>
HtmlDoc d.InitFromText(s)
;str s2=d.GetHtml("div" 0)
str s2=d.GetText("div" 0)
;out s2
str s3
if(findrx(s2 "\bhttp:\S+" 0 1 s3)<0) ret
out s3
Often you can easily extract required strings from whole page HTML using findrx. Use HtmlDoc only when it is too difficult. HtmlDoc uses IE HTML parsing engine to parse page HTML into smaller elements. Then you find required elements, and work with their text or HTML using string functions.
containerTag is HTML tag name, like div. To find first div, use d.GetText("div" 0), to find next div, use d.GetText("div" 1), and so on.
HtmlDoc.d and d3 are variables of type IHTMLDocument2 and IHTMLDocument3. Both can be used to access MSHTML DOM. Documented in MSDN library.
Posts: 795
Threads: 136
Joined: Feb 2009
the problem with example code 1, is that i can't know before using the macro what is the number of the item to use.
HtmlDoc d.InitFromText(s)
;str s2=d.GetHtml("div" 0)
str s2=d.GetText("div" 0)
This can be ("div" 25 or 3458 or 1250)
;out s2
str s3
if(findrx(s2 "\bhttp:\S+" 0 1 s3)<0) ret
out s3
Is there a way to have the numbers of "div" tag?
In that example, i *DO* search for text in a <div class='text'> tag.
Out to find it?
Quote:HtmlDoc.d and d3 are variables of type IHTMLDocument2 and IHTMLDocument3. Both can be used to access MSHTML DOM. Documented in MSDN library.
sorry but it's cryptic to me, i did not even got the differece between IHTMLDocument2 and IHTMLDocument3. So far too much for my skills.
Posts: 12,073
Threads: 140
Joined: Dec 2002
Macro
Macro1090
str s=
;<body>
;<div>a</div>
;<div>b</div>
;<div class='text'>
;<b>Covers</b><br/>
;http://xxxyyyyzzzz.com/somefile.html <- i want that
;</div>
;<div>x</div>
;<div>y</div>
;</body>
HtmlDoc d.InitFromText(s)
ARRAY(MSHTML.IHTMLElement) a
d.GetHtmlElements(a "div")
int i
for i 0 a.len
,out "----------"
,str s2=a[i].innerText
,out s2
,
Posts: 795
Threads: 136
Joined: Feb 2009
yes, i tried that.
but it's too much time consuming, and i know what part of tags i want (<div class='text'> or <div class="image"> or <a href=).
So i'd like to search for those specific tags i need.
If not possible, i'll go the findrx way, which i'd like to avoid. I though html classes could make the job easier.
Sorry for that.
Posts: 795
Threads: 136
Joined: Feb 2009
BTW,
how test for tags with " in it with findrx, i can't find the trick.
findrx(text "href"=" 0 16 found).
Would help much.
Posts: 12,073
Threads: 140
Joined: Dec 2002
Read QM help topic "Constants".
Posts: 795
Threads: 136
Joined: Feb 2009
Ok, when a good idea is not.
In fact, i switched back to my prior way of doing it, by grep'ing html text via findrx.
Thought a dedicated class could help, but not in my case.
Sometimes, old classic ways are the way to go.
Thanks.