I am simply trying to download a webpage and store it in an accessible format in SQL Server 2012. I have resorted to using dynamic SQL, but perhaps there is a cleaner, easier way to do this. I have been able to successfully download the htm files to my local drive using the below code, but I am having difficulty working with the html itself. I am trying to convert the webpage to XML and parse from there, but I think I am not addressing the HTML to XML conversion properly.
I get the following error, "Parsing XML with internal subset DTDs not allowed. Use CONVERT with style option 2 to enable limited internal subset DTD support"
DECLARE @URL NVARCHAR(500); DECLARE @Ticker NVARCHAR(10) DECLARE @DynamicTickerNumber INT SET @DynamicTickerNumber = 1 CREATE TABLE Parsed_HTML( [Date] DATETIME ,[Ticker] VarChar (8) ,[NodeName] VarChar (50) ,[Value] NVARCHAR (50)); WHILE @DynamicTickerNumber <= 2 BEGIN SET @Ticker = (SELECT [Ticker] FROM [Unique Tickers Yahoo] WHERE [Unique Tickers Yahoo].[Ticker Number]= @DynamicTickerNumber) SET @URL ='http://finance.yahoo.com/q/ks?s=' + @Ticker + '+Key+Statistics' DECLARE @cmd NVARCHAR(250); DECLARE @tOutput TABLE(data NVARCHAR(100)); DECLARE @file NVARCHAR(MAX); SET @file='D:\Ressources\Execution Model\Execution Model for SQL\DB Temp\quoteYahooHTML.htm' SET @cmd ='powershell "(new-object System.Net.WebClient).DownloadFile('''+@URL+''','''+@file+''')"' EXEC master.dbo.xp_cmdshell @cmd, no_output CREATE TABLE XmlImportTest ( xmlFileName VARCHAR(300), xml_data xml ); DECLARE @xmlFileName VARCHAR(300) SELECT @xmlFileName = 'D:\Ressources\Execution Model\Execution Model for SQL\DB Temp\quoteYahooHTML.htm' EXEC(' INSERT INTO XmlImportTest(xmlFileName, xml_data) SELECT ''' + @xmlFileName + ''', xmlData FROM ( SELECT * FROM OPENROWSET (BULK ''' + @xmlFileName + ''' , SINGLE_BLOB) AS XMLDATA ) AS FileImport (XMLDATA) ') DECLARE @x XML; DECLARE @string VARCHAR(MAX); SET @x = (SELECT xml_data FROM XmlImportTest) SET @string = CONVERT(VARCHAR(MAX), @x, 1); INSERT INTO [Parsed_HTML] ([NodeName], [Value]) SELECT [NodeName], [Value] FROM dbo.XMLTable(@string) --above references XMLTable Parsing function that works consistently END Unfortunately this needs to be run within the confines of SQL Server, and my understanding is that the HTML Agility Pack is not immediately compatible. I also notice that the intermediate table, XMLimportTest, never gets populated, so this is likely not a function of malformed HTML.
No comments:
Post a Comment