Login

Bett · 09-04-2014, 06:37 PM

Has anyone built a script that enumerates local Folders, HTML Files, then generates an HTML page with links to each page using the individual HTML title as the link?

For example, using Enumerate Files, I get excellent raw output:
C:\pub\1\index.html (Title: Miscellaneous Questions)*
C:\pub\2\index.html (Title: Key Questions)*
C:\pub\Overuse.html (Title: Over-using the System)*
* Enumerate Files does not show Titles, just file names.

I would like the output to Look Like:
\1 Miscellaneous Questions
\2 Key Questions
\ Over-using the System

***Gintaras*** · 09-04-2014, 07:24 PM

Function GetTitleFromHTML

Code: Copy      Help
;/

function! $html str&title [flags] ;;flags: 1 html is file, 2 fast but unreliable

;Extracts title from <title> tag in HTML.

;Returns 1 if title found, 0 if not.

;html - HTML text. If flag 1 - full path of a local HTML file.

;title - variable that receives title text.

;flags:

;;;1 - html is HTML file path.

;;;2 - to extract title, use regular expression. Almost 1000 times faster, but unreliable, eg can extract <title> from comments or scripts. Without this flag uses HtmlDoc class to parse the HTML.

;EXAMPLE

;str title

;if(GetTitleFromHTML("c:\test\test.htm" title 1)) out title; else out "<NO TITLE>"

opt noerrorshere 1

if(flags&1) html=_s.getfile(html)

title.all

if flags&2

,if(findrx(html "(?si)<title.*?>(.+?)</title>" 0 0 title 1)<0) ret

,title.trim; title.replacerx("\s+" " ")

,ret 1

HtmlDoc d.InitFromText(html)

title=d.d.title

ret title.len!0

Bett · 09-06-2014, 02:49 AM

Thanks Gintaras,
Sorry for the dim request. I thought someone might have something setting on the shelf. My code is much uglier than yours, but I managed to extract almost everything.
Now I'm down to the tough part (for me). I have not yet figured out how to extract a string like "Home-Test System-Notes-Oct 1999 from a set like this :

<div class="nv">
<a class="nv" href="../index.html">Home</a>
-
<a class="nv" href="../10/index.html">Test System</a>
-
<a class="nv" href="../783/index.html">Notes</a>
- Oct 1999
</div>

The number of lines varies, but each is within <div class="nv"></div> and first character of separation line is always "-".

Is there a clean way to do this with a regx?

***Gintaras*** · 09-06-2014, 04:49 AM

With regex difficult, unless HTML is quite simple.
Use HtmlDoc.GetText.
Macro Macro2353

Code: Copy      Help
str html

;...

HtmlDoc d.InitFromText(html)

str text=d.GetText

Login
Username:
Password:	Lost Password?
	Remember me