0 votes
in Education by (1.7m points)
I have an HTML page that contains some filenames that i want to download from a webserver. I need to read these filenames in order to create a list that will be passed to my web application that downloads the file from the server. These filenames have some extention.

I have digged about this topic but havn't fount anything except -

Regex cannt be used to parse HTML.

Use HTML Agility Pack

Is there no other way so that i can search for text that have pattern like filename.ext from an HTML file?

Sample HTML that contains filename -

 <p class=3DMsoNormal style=3D'margin-top:0in;margin-right:0in;margin-bottom=:0in; margin-left:1.5in;margin-bottom:.0001pt;text-indent:-.25in;line-height:normal;mso-list:l1 level3 lfo8;tab-stops:list 1.5in'><![if !supportLists]> <span style=3D'font-family:"Times New Roman","serif";mso-fareast-font-family:"Times New Roman"'><span style=3D'mso-list:Ignore'>1.<span style=3D'font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

    </span></span></span><![endif]><span style=3D'font-family:"Times New Roman","serif"; mso-fareast-font-family:"Times New Roman"'>**13572_PostAccountingReport_2009-06-03.acc**<o:p></o:p></span></p>

I cant use HTML Agility Pack because I m not allowed to download and make use of any application or tool.

Cant this be achieved by anyother logic?

This is what i have done so far

string pageSource = "";

            string geturl = @"C:\Documents and Settings\NASD_Download.mht";

            WebRequest getRequest = WebRequest.Create(geturl);

            WebResponse getResponse = getRequest.GetResponse();

            using (StreamReader sr = new StreamReader(getResponse.GetResponseStream()))

            {

                pageSource = sr.ReadToEnd();

                pageSource.Replace("=", "");

            }

           var fileNames = from Match m in Regex.Matches(pageSource, @"[0-9]+_+[A-Za-z]+_+[0-9]+-+[0-9]+-+[0-9]+.+[a-z]")

                          select m.Value;

            foreach (var s in fileNames)

                Response.Write(s);

Bcause of some "=" occuring in every file name i m not able to get the filename. how can I remove the occurrence of "=" in pageSource string

Thanks in advance

Akhil

JavaScript questions and answers, JavaScript questions pdf, JavaScript question bank, JavaScript questions and answers pdf, mcq on JavaScript pdf, JavaScript questions and solutions, JavaScript mcq Test , Interview JavaScript questions, JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)

1 Answer

0 votes
by (1.7m points)
Well, knowing that regex aren't ideal to find values in HTML:

var files = [];

var p = document.getElementsByTagName('p');

for (var i = 0; i < p.length; i++){

    var match = p[i].innerHTML.match(/\s(\S+\.ext)\s/)

    if (match)

        files.push(match[1]);

}

Live DEMO

Note: Read the comments to the question.

If the extension can be anything, you can use this:

var files = [];

var p = document.getElementsByTagName('p');

for (var i = 0; i < p.length; i++){

    var match = p[i].innerHTML.match(/\b(\S+\.\S+)\b/)

    console.log(match)

    if (match)

        files.push(match[1]);

}

document.getElementById('result').innerHTML = files + "";

But this really really not reliable.

Live DEMO
...