Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Enumerate local Folders, Files and titles to HTML page
#1
Has anyone built a script that enumerates local Folders, HTML Files, then generates an HTML page with links to each page using the individual HTML title as the link?

For example, using Enumerate Files, I get excellent raw output:
C:\pub\1\index.html (Title: Miscellaneous Questions)*
C:\pub\2\index.html (Title: Key Questions)*
C:\pub\Overuse.html (Title: Over-using the System)*
* Enumerate Files does not show Titles, just file names.

I would like the output to Look Like:
\1 Miscellaneous Questions
\2 Key Questions
\ Over-using the System
#2
Function GetTitleFromHTML
Code:
Copy      Help
;/
function! $html str&title [flags] ;;flags: 1 html is file, 2 fast but unreliable

;Extracts title from <title> tag in HTML.
;Returns 1 if title found, 0 if not.

;html - HTML text. If flag 1 - full path of a local HTML file.
;title - variable that receives title text.
;flags:
;;;1 - html is HTML file path.
;;;2 - to extract title, use regular expression. Almost 1000 times faster, but unreliable, eg can extract <title> from comments or scripts. Without this flag uses HtmlDoc class to parse the HTML.

;EXAMPLE
;str title
;if(GetTitleFromHTML("c:\test\test.htm" title 1)) out title; else out "<NO TITLE>"


opt noerrorshere 1
if(flags&1) html=_s.getfile(html)

title.all
if flags&2
,if(findrx(html "(?si)<title.*?>(.+?)</title>" 0 0 title 1)<0) ret
,title.trim; title.replacerx("\s+" " ")
,ret 1

HtmlDoc d.InitFromText(html)
title=d.d.title
ret title.len!0
#3
Thanks Gintaras,
Sorry for the dim request. I thought someone might have something setting on the shelf. My code is much uglier than yours, but I managed to extract almost everything.
Now I'm down to the tough part (for me). I have not yet figured out how to extract a string like "Home-Test System-Notes-Oct 1999 from a set like this :

<div class="nv">
<a class="nv" href="../index.html">Home</a>
-
<a class="nv" href="../10/index.html">Test System</a>
-
<a class="nv" href="../783/index.html">Notes</a>
- Oct 1999
</div>

The number of lines varies, but each is within <div class="nv"></div> and first character of separation line is always "-".

Is there a clean way to do this with a regx?
#4
With regex difficult, unless HTML is quite simple.
Use HtmlDoc.GetText.
Macro Macro2353
Code:
Copy      Help
str html
;...
HtmlDoc d.InitFromText(html)
str text=d.GetText


Forum Jump:


Users browsing this thread: 4 Guest(s)