求 c#下的使用 socks 下载网页程序

sbmzhcn

2014-10-23 21:08:08 +08:00

添加一个最重要的，忘了，要支持异步，不然下载的时候会卡在那儿永远不会动，我遇到过。

dong3580

2014-10-23 21:48:42 +08:00

看看模拟登录的代码试试,抓取原理类似，不过，我到没用过代理实现抓取页面，不知道可不可行。

xenme

2014-10-23 21:55:52 +08:00

单独每一个搜到类，然后自己拼起来嘛。

mengskysama

2014-10-23 22:00:55 +08:00

curl，自己在实现个线程池管一下就行了。.NET十几行代码的事情

sbmzhcn

2014-10-23 22:03:48 +08:00

如果这么简单我就不会问了，这问题一年多了吧，虽然我用代码写出来了，但稳定性不好，我想借鉴下其他代码，说十几行代码可以搞定我只有呵呵了。

mengskysama

2014-10-23 22:16:11 +08:00

@sbmzhcn 为什么非要自己写，curl你要的实现都有，而且也是开源的，NET用就封装调用一下就行了。这东西根本不需要异步，自己用过线程池管理，gh上也有别人的异步封装。

clijiac

2014-10-23 22:33:38 +08:00

最简单的不是弄个socks5 to http proxy么,用polipo试试

takwai

2014-10-24 00:57:39 +08:00

@sbmzhcn 卡 UI 的话，开一个新线程去处理下载即可。需求中的支持多线程，是同一个源多线程下载？

@mengskysama 提供的方法是可行的，下面是简单实现。

https://gist.github.com/takwai/871c59d9112133d2390c

sbmzhcn

2014-10-24 09:09:27 +08:00

'''
#region page source analytics
private void AnalyPageSource(MemoryStream ms,
out WebHeaderCollection responseHeaders,
out string pageContent)
{
byte[] bytes = ms.ToArray();
StringBuilder string_buffer = new StringBuilder();
byte[] contentBytes = new byte[bytes.Length];

responseHeaders = new WebHeaderCollection();
pageContent = string.Empty;
// process headers
for (int i = 0; i < bytes.Length; i++)
{
string_buffer.Append((char)bytes[i]);
if (string_buffer.ToString().ToLower().IndexOf("\r\n\r\n") != -1)
{
int index = string_buffer.ToString().IndexOf("\r\n\r\n");
string originalheaderString = string_buffer.ToString().Substring(0, index);
Header = originalheaderString;

//Array.Copy(bytes, index + 4, contentBytes, 0, ms.Length - index - 4);
Array.Copy(bytes, index + 4, contentBytes, 0, bytes.Length - index - 4);
string originalPageContent = Encoding.UTF8.GetString(contentBytes);

string responseState = "";
string[] headers = originalheaderString.Split(new string[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries);
foreach (string item in headers)
{
if ( item.StartsWith( "Set-Cookie:", StringComparison.OrdinalIgnoreCase ))
{
string tCookie = item.Substring(11, item.IndexOf(";") < 0 ? item.Length - 11 : item.IndexOf(";") - 10).Trim();
if ( !this.Cookies.Exists( f => f.Split( '=' )[0] == tCookie.Split( '=' )[0] ) )
{
this.Cookies.Add( tCookie );
}
}
int colonIndex = item.IndexOf(":");
if (colonIndex > -1)
responseHeaders.Add(item.Substring(0, colonIndex).Trim(), item.Substring(colonIndex + 1).Trim());
else
responseState = item;
}
StatusCode = responseState.Split(' ')[1];

if (responseState.IndexOf(" 302 ") != -1 || responseState.IndexOf(" 301 ") != -1)
{
if (responseHeaders["Location"] != null)
{
try { ResponseUri = new Uri(responseHeaders["Location"]); }
catch { ResponseUri = new Uri(ResponseUri, responseHeaders["Location"]); }
}
}

ContentType = Headers["Content-Type"];
if (Headers["Content-Length"] != null)
ContentLength = int.Parse(Headers["Content-Length"]);
KeepAlive = (Headers["Connection"] != null && Headers["Connection"].ToLower() == "keep-alive") ||
(Headers["Proxy-Connection"] != null && Headers["Proxy-Connection"].ToLower() == "keep-alive");

try
{
if (!String.IsNullOrEmpty(responseHeaders[HttpResponseHeader.TransferEncoding]))
{
if (responseHeaders[HttpResponseHeader.TransferEncoding].Contains("chunked"))
{
contentBytes = ChunkedDecompress(contentBytes);
}
}
}
catch (Exception ex)
{
throw new Exception("处理chunked数据失败：" + ex);
}
try
{
if (!String.IsNullOrEmpty(responseHeaders[HttpResponseHeader.ContentEncoding]))
{
if (responseHeaders[HttpResponseHeader.ContentEncoding].Contains("gzip")
|| responseHeaders[HttpResponseHeader.ContentEncoding].Contains("deflate"))
{
pageContent = GzipDecompress(contentBytes);
}
}
else
{
pageContent = Encoding.UTF8.GetString(contentBytes);
}
}
catch (Exception ex)
{
throw new Exception("GzipDecompress Error: " + ex.Message);
}

if (string.IsNullOrEmpty(pageContent) && !string.IsNullOrEmpty(originalPageContent))
{
pageContent = originalPageContent;
}
break;
}
}
}
#endregion
'''

这是我代码中的一部分，为什么我没觉得那么简单呢。

mengskysama

2014-10-24 11:30:48 +08:00

@sbmzhcn 因为你是自己实现....现在没有程序猿会这么'愚蠢'。

你要自己现也ok，对于理解http协议还是很有帮助的，至少先要读HTTP那几十篇RFC文档，不过如果你的目的是生产强烈建议你不要这么干。

我指出你代码2个问题，第一个你的contentBytes是基于\r\n\r\n得到的，你可以查一下文档对于指定Content-Length的content前端是不强制性要求有这个标记的。第二个content的编码不一定是UTF8。所以通用性上面肯定有问题。

还有你代码太乱。实现上你可以参考下python的urllib3的设计，也是完全开源的已经完整实现了HTTPS和连接池的管理了，代码也质量也相当高。所以我干这种事情都不用.NET。。。。

sbmzhcn

2014-10-28 13:03:18 +08:00

@mengskysama 首先必须用.net，因为要求开发C#的程序，二是我也不想这样实现，但这样实际有几个原因，一是使用这样的实现可以使用socket代理，不知道其它的httpwebrequest有什么办法可以直接使用socket代理。我知道很少有人这样实现，所以才这样问，有没有这样的代码，就因为这样的代码很少，我才问的。你说的我也理解。