Posts: 27
Threads: 9
Joined: Apr 2013
Hello QMers,
I have been searching the forum for another topic that talks about setting up a web scrape with Quick Macros. I have a webpage that is behind a login page so none of the typical scrapers out there work. In other words, I need to create a GUI hacked scraper.
The data I want to scrape is in the same format with multiple rows of data. I would like to write one scraper for complete row and then put that in a loop to grab the rest of the rows throughout the page and then move on to the next page.
I have already used the html element actions to create the scraper but the elements change to the next page. I am not sure how to attack this one and I would greatly appreciate some help.
Let me know what other information I can provide to help you understand what I am trying to do.
Thanks,
Paul
Posts: 27
Threads: 9
Joined: Apr 2013
Alright apparently I am not speaking the right language to get some ideas flowing here. I realized now after further searching the forum that the proper term for what I am trying to do is html extraction. I have put together this code (below) from some of the other posts I have read and now I am able to get just the text out of the webpage I am using. However, I need to format the text coming out into a csv format and I am a little at a loss of how to do this. The main problem is knowing which text is coming out so that I can put it in the correct column of the csv file.
int w=wait(3 WV win("List Details - Windows Internet Explorer" "IEFrame"))
Acc a1.Find(w "PANE" "List Details" "" 0x3001 3)
str html
a1.WebPageProp(0 0 html)
HtmlDoc d.InitFromText(html)
ARRAY(MSHTML.IHTMLElement) a
d.GetHtmlElements(a "")
int i
for i 0 a.len
out "----------"
str s2=a[i].innerText
out s2
I have also posted a excerpt of the html I am trying to extract from. This excerpt is one row of the csv file and there are 20 more blocks of html just like this one on the page that would like to extract. Any help on capturing these unique pieces of information would be a huge help. You can also see below this code a look at how I would like to format the csv file as well.
<div class="search-result-container contact-result row-fluid"><div class="span12">
<div class="item-actions-container">
<div class="actions-row long-line">
<div class="actions-container inline-block" style="width: 80px;">
<div class="touch-button-container inline-block pull-right">
<div title="Pin" class="pin-this"></div>
</div>
<div class="touch-button-container inline-block pull-right">
</div>
<div class="touch-button-container touch-right-divider inline-block pull-right">
<div title="Quick View" class="quick-view"></div>
<div class="right-divider"></div>
</div>
</div><div class="social-row">
<div class="search-result-google search-result-social pull-right">
<a href="https://plus.google.com/s/Alex%20Abadi" target="_blank"></a>
</div>
<div class="search-result-facebook search-result-social pull-right">
<a href="https://www.facebook.com/search/more/?q=Alex%20Abadi" target="_blank"></a>
</div>
<div class="search-result-twitter search-result-social pull-right">
<a href="https://twitter.com/search?q=Alex%20Abadi&mode=users" target="_blank"></a>
</div>
<div class="search-result-linkedin search-result-social pull-right">
<a href="http://www.linkedin.com/vsearch/f?keywords=Alex+Abadi" target="_blank"></a>
</div>
<!--<div class="search-result-companyURL inline-block">-->
<div class="search-result-url search-result-social pull-right">
<a href="http://www.imagemicrosystems.com" target="_blank"></a>
</div>
</div><div class="connection-meter list-only pull-left">
<!--<div class="left-side"></div>-->
<!--<div class="middle"></div>-->
<!--<div class="right-side"></div>-->
</div>
</div>
</div>
<div class="logo-container">
<div class="selected-status pull-left"></div>
<input class="pull-left" type="checkbox" name="searchResults-10611e14-c5b5-3cac-9679-7b69997eb75d" id="10611e14-c5b5-3cac-9679-7b69997eb75d" data-primitive-type="contact">
<div class="image-wrapper">
<!--<div class="p-meter-wrapper"><i class="icon p-meter list-only" ></i></div>-->
<div class="search-result-icon contact-icon"></div>
<div class="favicon-container">
</div>
</div>
<i class="icon ideal-prospect-img list-only"></i>
<div class="ideal-prospect-val list-only">
0
</div>
</div>
<div class="detail-container">
<div class="name-row">
<a href="/contact/10611e14-c5b5-3cac-9679-7b69997eb75d">Alex Abadi</a>
</div>
<div class="search-result-subheadline">
<span class="large-black-text">Chief Executive Officer at </span>
<span class="contact-company-name"><a href="/company/d0a95324-611b-36b7-8a5b-b753ab957e36" class="clickable">Image Microsystems, Inc.</a></span>
</div>
<div class="compact-section">
<div class="location">Austin,
Texas,
United States
<div class="contact-industry">Computer and Peripheral Equipment Manufacturing</div>
</div>
<div class="compact-section">
<div class="small-data-label">Main:</div>
<div class="inline-block black-text"><span id="gc-number-24" class="gc-cs-link" title="Call with Google Voice">512-623-5621</span></div>
<div>
<div class="small-data-label">Direct:</div>
<div class="inline-block black-text"><span id="gc-number-25" class="gc-cs-link" title="Call with Google Voice">512-623-5642</span></div>
</div>
<div>
<div class="small-data-label">Email:</div>
<a class="black-text" href="mailto:[email protected]">[email protected]</a>
</div>
</div>
<div class="">
</div>
</div>
</div>
<div class="right-wrapper">
<div class="stick-bottom pull-right">
<div class="notification-container list-only">
<a class="trigger-wrapper pull-right hidden" href="/contact/10611e14-c5b5-3cac-9679-7b69997eb75d?report=company_triggers">
<span class="trigger-count pull-right"></span>
<div class="trigger-icon-color pull-right"></div>
</a>
<div class="notes-wrapper dropdown text-right">
<a class="notes dropdown-toggle" data-toggle="dropdown" role="button" data-target="dropdown" data-item-id="10611e14-c5b5-3cac-9679-7b69997eb75d">
Notes <span class="note-count"></span></a>
<div class="dropdown-menu text-left">
<form class="noteEditForm">
<div class="helptext noteActionLabel">Add a New Note:</div>
<input type="text" name="label" class="noteLabel" placeholder="Title">
<textarea name="messageBody" class="noteBody" placeholder="Body"></textarea>
<input type="hidden" name="entityId" class="entityId" value="10611e14-c5b5-3cac-9679-7b69997eb75d">
<input type="hidden" name="entityType" class="entityType" value="contact">
<input type="hidden" name="id" class="noteId">
<div class="button-wrapper pull-right">
<a class="cancelNoteButton cancel-link" data-dismiss="dropdown" aria-hidden="true">Cancel</a>
<input type="submit" class="saveNoteButton btn btn-blue-small" value="Save">
</div>
<div class="clearfix"></div>
</form>
<div class="existing-notes hide">
<div class="helptext">Open an Existing Note:</div>
<ul>
</ul>
</div>
</div>
</div>
</div>
<div class="crm-status" data-id="10611e14-c5b5-3cac-9679-7b69997eb75d">
</div>
<div class="list-add-date text-right pull-right list-only">Added 6-Jan-2016</div>
</div>
</div></div></div>
CSV File Example - This csv needs to be separated by Tabs because there are "," in the data coming out of the html that I don't want to separate.
Name Title Company City/State Industry Main Phone Direct Phone Email Added
Alex Abadi Chief Executive Officer at Image Microsystems, Inc. Austin, Texas, United States Computer and Peripheral Equipment Manufacturing 512-623-5621 512-623-5642 [email protected] Added 6-Jan-2016
Any help that you can provide in helping me identify the particular html elements to pull out would be great as currently I am only able to pull all text into a text file which isn't helpful for the project I am working on.
Really appreciate any help you give me.
Best Regards,
Paul
Posts: 27
Threads: 9
Joined: Apr 2013
I have made it further after finding the following code on the forum. However, this still doesn't fire on all cylinders for me because I'm missing the industry, city, state and the date the contact was added. Both of which I need to have from my extraction.
Gintaras, I sure would appreciate it if you could help me figure out the last piece of the puzzle here. I was trying to use the .className to identify the "location" and "industry" classes but for some reason the for loop being used doesn't allow for a sel case to be used to capture this data separately. Finally the last piece of this puzzle is getting the data into columns and rows of a tab deliminated csv file. Any help you could provide with this would be great too.
str s=
<BODY>
<div class="detail-container">
<div class="name-row">
<a href="/contact/b070f5e9-30d7-3da5-bc39-780c3455b71e">Mitch Acker</a>
</div>
<div class="search-result-subheadline">
<span class="large-black-text">President, Sales Executive at </span>
<span class="contact-company-name"><a href="/company/66819229-e58e-36e8-a282-c11f68eb2453" class="clickable">Martinaire Inc</a></span>
</div>
<div class="compact-section">
<div class="location">Addison,
Texas,
United States
<div class="contact-industry">Airlines</div>
</div>
<div class="compact-section">
<div class="small-data-label">Main:</div>
<div class="inline-block black-text"><span id="gc-number-20" class="gc-cs-link" title="Call with Google Voice">972-349-5700</span></div>
<div>
<div class="small-data-label">Email:</div>
<a class="black-text" href="mailto:[email protected]">[email protected]</a>
</div>
</div>
<div class="">
</div>
</div>
</div>
<div class="detail-container">
<div class="name-row">
<a href="/contact/10611e14-c5b5-3cac-9679-7b69997eb75d">Alex Abadi</a>
</div>
<div class="search-result-subheadline">
<span class="large-black-text">Chief Executive Officer at </span>
<span class="contact-company-name"><a href="/company/d0a95324-611b-36b7-8a5b-b753ab957e36" class="clickable">Image Microsystems, Inc.</a></span>
</div>
<div class="compact-section">
<div class="location">Austin,
Texas,
United States
<div class="contact-industry">Computer and Peripheral Equipment Manufacturing</div>
</div>
<div class="compact-section">
<div class="small-data-label">Main:</div>
<div class="inline-block black-text"><span id="gc-number-24" class="gc-cs-link" title="Call with Google Voice">512-623-5621</span></div>
<div>
<div class="small-data-label">Direct:</div>
<div class="inline-block black-text"><span id="gc-number-25" class="gc-cs-link" title="Call with Google Voice">512-623-5642</span></div>
</div>
<div>
<div class="small-data-label">Email:</div>
<a class="black-text" href="mailto:[email protected]">[email protected]</a>
</div>
</div>
<div class="">
</div>
</div>
</div>
</BODY>
out
s.findreplace("span" "a")
HtmlDoc d.InitFromText(s)
ARRAY(MSHTML.IHTMLElement) h2 div
int i j
d.GetHtmlElements(div "div")
for i 0 div.len
str cn=div[i].className
if cn="detail-container"
d.GetHtmlElements(h2 "a" "" div[i].sourceIndex)
for j 0 h2.len
out h2[j].innerText
Thanks Again,
Paul