Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
html Extraction
#2
Alright apparently I am not speaking the right language to get some ideas flowing here. I realized now after further searching the forum that the proper term for what I am trying to do is html extraction. I have put together this code (below) from some of the other posts I have read and now I am able to get just the text out of the webpage I am using. However, I need to format the text coming out into a csv format and I am a little at a loss of how to do this. The main problem is knowing which text is coming out so that I can put it in the correct column of the csv file.

Code:
Copy      Help
int w=wait(3 WV win("List Details - Windows Internet Explorer" "IEFrame"))
Acc a1.Find(w "PANE" "List Details" "" 0x3001 3)
str html
a1.WebPageProp(0 0 html)


HtmlDoc d.InitFromText(html)
ARRAY(MSHTML.IHTMLElement) a
d.GetHtmlElements(a "")
int i
for i 0 a.len
    out "----------"
    str s2=a[i].innerText
    out s2

I have also posted a excerpt of the html I am trying to extract from. This excerpt is one row of the csv file and there are 20 more blocks of html just like this one on the page that would like to extract. Any help on capturing these unique pieces of information would be a huge help. You can also see below this code a look at how I would like to format the csv file as well.

Code:
Copy      Help
<div class="search-result-container contact-result row-fluid"><div class="span12">
    <div class="item-actions-container">
        <div class="actions-row long-line">
<div class="actions-container inline-block" style="width: 80px;">
    <div class="touch-button-container inline-block pull-right">
        <div title="Pin" class="pin-this"></div>
    </div>
    <div class="touch-button-container inline-block pull-right">
        
    </div>
    <div class="touch-button-container touch-right-divider inline-block pull-right">
        <div title="Quick View" class="quick-view"></div>
        <div class="right-divider"></div>
    </div>
</div><div class="social-row">

    <div class="search-result-google search-result-social  pull-right">
        
        <a href="https://plus.google.com/s/Alex%20Abadi" target="_blank"></a>
    </div>

    <div class="search-result-facebook search-result-social  pull-right">
        
        <a href="https://www.facebook.com/search/more/?q=Alex%20Abadi" target="_blank"></a>
    </div>

    <div class="search-result-twitter search-result-social  pull-right">
        
        <a href="https://twitter.com/search?q=Alex%20Abadi&amp;mode=users" target="_blank"></a>
    </div>

    <div class="search-result-linkedin search-result-social  pull-right">
        
        <a href="http://www.linkedin.com/vsearch/f?keywords=Alex+Abadi" target="_blank"></a>
    </div>

    <!--<div class="search-result-companyURL inline-block">-->
    <div class="search-result-url search-result-social  pull-right">
        <a href="http://www.imagemicrosystems.com" target="_blank"></a>
    </div>

</div><div class="connection-meter list-only pull-left">
  <!--<div class="left-side"></div>-->
  <!--<div class="middle"></div>-->
  <!--<div class="right-side"></div>-->
</div>
        </div>
    </div>
    <div class="logo-container">
        <div class="selected-status  pull-left"></div>
        <input class="pull-left" type="checkbox" name="searchResults-10611e14-c5b5-3cac-9679-7b69997eb75d" id="10611e14-c5b5-3cac-9679-7b69997eb75d" data-primitive-type="contact">
        <div class="image-wrapper">
            <!--<div class="p-meter-wrapper"><i class="icon p-meter list-only" ></i></div>-->
            <div class="search-result-icon contact-icon"></div>
            <div class="favicon-container">
            </div>
        </div>
        <i class="icon ideal-prospect-img list-only"></i>

        <div class="ideal-prospect-val list-only">
            0
        </div>
    </div>
    <div class="detail-container">
        <div class="name-row">
            <a href="/contact/10611e14-c5b5-3cac-9679-7b69997eb75d">Alex  Abadi</a>
        </div>
        <div class="search-result-subheadline">
            <span class="large-black-text">Chief Executive Officer at </span>
            <span class="contact-company-name"><a href="/company/d0a95324-611b-36b7-8a5b-b753ab957e36" class="clickable">Image Microsystems, Inc.</a></span>
        </div>
        <div class="compact-section">
            <div class="location">Austin,
                Texas,
                United States
                <div class="contact-industry">Computer and Peripheral Equipment Manufacturing</div>
            </div>

            <div class="compact-section">
                  <div class="small-data-label">Main:</div>
                  <div class="inline-block black-text"><span id="gc-number-24" class="gc-cs-link" title="Call with Google Voice">512-623-5621</span></div>
                  <div>
                      <div class="small-data-label">Direct:</div>
                      <div class="inline-block black-text"><span id="gc-number-25" class="gc-cs-link" title="Call with Google Voice">512-623-5642</span></div>
                  </div>
                <div>
                    <div class="small-data-label">Email:</div>
                    <a class="black-text" href="mailto:alex_abadi@imagemicrosystems.com">alex_abadi@imagemicrosystems.com</a>
                </div>
            </div>

            <div class="">
            </div>
        </div>
    </div>
<div class="right-wrapper">
    <div class="stick-bottom pull-right">
        <div class="notification-container list-only">
            <a class="trigger-wrapper pull-right hidden" href="/contact/10611e14-c5b5-3cac-9679-7b69997eb75d?report=company_triggers">
                <span class="trigger-count pull-right"></span>
                <div class="trigger-icon-color pull-right"></div>
            </a>
  <div class="notes-wrapper dropdown text-right">
      <a class="notes dropdown-toggle" data-toggle="dropdown" role="button" data-target="dropdown" data-item-id="10611e14-c5b5-3cac-9679-7b69997eb75d">
      Notes <span class="note-count"></span></a>
    <div class="dropdown-menu text-left">
      <form class="noteEditForm">
        <div class="helptext noteActionLabel">Add a New Note:</div>
        <input type="text" name="label" class="noteLabel" placeholder="Title">
        <textarea name="messageBody" class="noteBody" placeholder="Body"></textarea>
        <input type="hidden" name="entityId" class="entityId" value="10611e14-c5b5-3cac-9679-7b69997eb75d">
        <input type="hidden" name="entityType" class="entityType" value="contact">
        <input type="hidden" name="id" class="noteId">
        <div class="button-wrapper pull-right">
          <a class="cancelNoteButton cancel-link" data-dismiss="dropdown" aria-hidden="true">Cancel</a>
          <input type="submit" class="saveNoteButton btn btn-blue-small" value="Save">
        </div>
        <div class="clearfix"></div>
      </form>

      <div class="existing-notes hide">
        <div class="helptext">Open an Existing Note:</div>
        <ul>
        </ul>
      </div>

    </div>
  </div>
        </div>
        <div class="crm-status" data-id="10611e14-c5b5-3cac-9679-7b69997eb75d">
        </div>
        <div class="list-add-date text-right pull-right list-only">Added 6-Jan-2016</div>
    </div>
</div></div></div>

CSV File Example - This csv needs to be separated by Tabs because there are "," in the data coming out of the html that I don't want to separate.

Code:
Copy      Help
Name    Title    Company    City/State    Industry    Main Phone    Direct Phone    Email    Added
Alex Abadi    Chief Executive Officer at    Image Microsystems, Inc.    Austin, Texas, United States    Computer and Peripheral Equipment Manufacturing    512-623-5621    512-623-5642    alex_abadi@imagemicrosystems.com    Added 6-Jan-2016

Any help that you can provide in helping me identify the particular html elements to pull out would be great as currently I am only able to pull all text into a text file which isn't helpful for the project I am working on.

Really appreciate any help you give me.

Best Regards,

Paul


Messages In This Thread

Forum Jump:


Users browsing this thread: 1 Guest(s)