Ubuntu - wget 下载工具

wget - "World Wide Web" 与 "get",是一个从网络上自动下载文件的自由工具,支持通过HTTP、HTTPS、FTP 三个常见的TCP/IP协议下载,并可以使用HTTP代理.

wget 可以跟踪 HTML 页面上的链接依次下载来创建远程服务器的本地版本,完全重建原始站点的目录结构. 这常称作 "递归下载".

wget 可以在下载时,同时将链接转换成指向本地文件,以方便离线浏览.

wget 非常稳定,即使在带宽很窄和网络不稳定时,适应性也很好.

如果是由于网络原因下载失败,wget 会不断的尝试,直到整个文件下载完毕.

如果是服务器打断下载过程,wget 会再次联到服务器上从停止的地方继续下载(断点下载). 对于限定了链接时间的服务器上下载大文件非常有用.

wget 功能强大,且使用比较简单.

wget命令

1. 常见 wget 命令

wget 用法:

wget(选项)(URL)

常用 (选项): (大小写敏感)

-a<日志文件>:在指定的日志文件中记录资料的执行过程; 
            –append-output=FILE 把记录追加到FILE文件中;
-A<后缀名>:指定要下载文件的后缀名,多个后缀名之间使用逗号进行分隔;
-b:进行后台的方式运行wget;
-B<连接地址>:设置参考的连接地址的基地地址;
-c:继续执行上次终端的任务;断点续传
-C<标志>:设置服务器数据块功能标志on为激活,off为关闭,默认值为on;
-d:调试模式运行指令;
-D<域名列表>:设置顺着的域名列表,域名之间用“,”分隔;
-e<指令>:作为文件“.wgetrc”中的一部分执行指定的指令;
-h:显示指令帮助信息;
-i<文件>:从指定文件获取要下载的URL地址;
-k:表示将下载的网页里的链接修改为本地链接;
-l<目录列表>:设置顺着的目录列表,多个目录用“,”分隔;
-L:仅顺着关联的连接;
-m:–mirror 等价于 -r -N -l inf -nr;
-nc:文件存在时,下载文件不覆盖原有文件;
-nd:递归下载时不创建一层一层的目录,把所有的文件下载到当前目录;
-nh:不查询主机名称;
-np:不要追溯到父目录
-nv:下载时只显示更新和出错信息,不显示指令的详细执行过程;
-N:不要重新下载文件除非比本地文件新;
-o:将log日志指定保存到文件(新建一个文件);
-p:获得所有显示网页所需的元素;
-P:指定下载目录;
-q:不显示指令执行过程;
-r:递归下载;
-v:显示详细执行过程;
-V:显示版本信息;
--passive-ftp:使用被动模式PASV连接FTP服务器;
--follow-ftp:从HTML文件中下载FTP连接文件。

2. 实例

2.1 下载单个文件

wget http://www.linuxde.net/testfile.zip

从网络下载一个文件并保存在当前目录,在下载的过程中会显示进度条,包含(下载完成百分比,已经下载的字节,当前下载速度,剩余下载时间).

2.2 下载并以不同的文件名保存

wget -O wordpress.zip http://www.linuxde.net/download.aspx?id=1080

wget 默认会以最后一个符合/的后面的字符来命令,对于动态链接的下载通常文件名会不正确.

错误:下面的例子会下载一个文件并以名称download.aspx?id=1080保存:

wget http://www.linuxde.net/download?id=1

即使下载的文件是zip格式,它仍然以download.php?id=1080命令.

正确:为了解决这个问题,我们可以使用参数-O来指定一个文件名:

wget -O wordpress.zip http://www.linuxde.net/download.aspx?id=1080

2.3 限速下载

wget --limit-rate=300k http://www.linuxde.net/testfile.zip

wget 默认会占用全部可能的宽带下载.

但是当准备下载一个大文件,而还需要下载其它文件时就有必要限速了.

2.4 断点续传

wget -c http://www.linuxde.net/testfile.zip

使用wget -c重新启动下载中断的文件,有利于下载大文件.

当网络等原因突然中断下载时,可以继续接着下载而不是重新下载一个文件.

2.5 后台下载

wget -b http://www.linuxde.net/testfile.zip

Continuing in background, pid 1840.
Output will be written to `wget-log'.

对于下载非常大的文件的时候,可以使用参数-b进行后台下载,可以使用以下命令来察看下载进度:

tail -f wget-log

2.6 伪装代理名称下载

wget --user-agent="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.204 Safari/534.16" \
    http://www.linuxde.net/testfile.zip

有些网站能通过根据判断代理名称不是浏览器而拒绝下载请求.

不过可以通过--user-agent参数伪装.

2.7 测试下载链接

当打算进行定时下载,应该在预定时间测试下载链接是否有效.

可以增加--spider参数进行检查.

wget --spider URL 

如果下载链接正确,将会显示:

Spider mode enabled. Check if remote file exists.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Remote file exists and could contain further links,
but recursion is disabled -- not retrieving.

这保证了下载能在预定的时间进行,但当给错了一个链接时,将会显示如下错误:

wget --spider url
Spider mode enabled. Check if remote file exists.
HTTP request sent, awaiting response... 404 Not Found
Remote file does not exist -- broken link!!!

可以在以下几种情况下使用--spider参数:

  • 定时下载之前进行检查
  • 间隔检测网站是否可用
  • 检查网站页面的死链接

2.8 增加重试次数

wget --tries=40 URL

如果网络有问题或下载一个大文件也有可能失败.

wget默认重试20次连接下载文件.

如果需要,可以使用--tries增加重试次数.

2.9 下载多个文件

wget -i filelist.txt

首先,保存一份下载链接文件:

cat > filelist.txt
url1
url2
url3
url4

接着使用这个文件和参数-i下载。

2.10 镜像网站

wget --mirror -p --convert-links -P ./LOCAL URL

下载整个网站到本地.

  • --miror开户镜像下载.
  • -p下载所有为了html页面显示正常的文件.
  • --convert-links下载后,转换成本地的链接.
  • -P ./LOCAL保存所有文件和目录到本地指定目录.

2.11 过滤指定格式下载

wget --reject=gif ur

下载一个网站,但不希望下载图片,可以使用这条命令.

2.12 把下载信息存入日志文件

wget -o download.log URL

不希望下载信息直接显示在终端而是在一个日志文件,可以使用.

2.13 限制总下载文件大小

wget -Q5m -i filelist.txt

当想要下载的文件超过 5M 而退出下载.

注意:这个参数对单个文件下载不起作用,只能递归下载时才有效.

2.14 下载指定格式文件

wget -r -A.pdf url

可以在以下情况使用该功能:

  • 下载一个网站的所有图片.
  • 下载一个网站的所有视频.
  • 下载一个网站的所有PDF文件.

2.15 FTP下载

wget ftp-url
wget --ftp-user=USERNAME --ftp-password=PASSWORD url

可以使用wget来完成ftp链接的下载.

使用wget匿名ftp下载:

wget ftp-url

使用 wget 用户名和密码认证的ftp下载:

wget --ftp-user=USERNAME --ftp-password=PASSWORD url

3. wget -h 全部命令

wget -h # 显示帮助选项,输出可选选项
Startup:
  -V,  --version                   display the version of Wget and exit
  -h,  --help                      print this help
  -b,  --background                go to background after startup
  -e,  --execute=COMMAND           execute a `.wgetrc'-style command

Logging and input file:
  -o,  --output-file=FILE          log messages to FILE
  -a,  --append-output=FILE        append messages to FILE
  -d,  --debug                     print lots of debugging information
  -q,  --quiet                     quiet (no output)
  -v,  --verbose                   be verbose (this is the default)
  -nv, --no-verbose                turn off verboseness, without being quiet
       --report-speed=TYPE         output bandwidth as TYPE.  TYPE can be bits
  -i,  --input-file=FILE           download URLs found in local or external FILE
  -F,  --force-html                treat input file as HTML
  -B,  --base=URL                  resolves HTML input-file links (-i -F)
                                     relative to URL
       --config=FILE               specify config file to use
       --no-config                 do not read any config file
       --rejected-log=FILE         log reasons for URL rejection to FILE

Download:
  -t,  --tries=NUMBER              set number of retries to NUMBER (0 unlimits)
       --retry-connrefused         retry even if connection is refused
  -O,  --output-document=FILE      write documents to FILE
  -nc, --no-clobber                skip downloads that would download to
                                     existing files (overwriting them)
  -c,  --continue                  resume getting a partially-downloaded file
       --start-pos=OFFSET          start downloading from zero-based position OFFSET
       --progress=TYPE             select progress gauge type
       --show-progress             display the progress bar in any verbosity mode
  -N,  --timestamping              don't re-retrieve files unless newer than
                                     local
       --no-if-modified-since      don't use conditional if-modified-since get
                                     requests in timestamping mode
       --no-use-server-timestamps  don't set the local file's timestamp by
                                     the one on the server
  -S,  --server-response           print server response
       --spider                    don't download anything
  -T,  --timeout=SECONDS           set all timeout values to SECONDS
       --dns-timeout=SECS          set the DNS lookup timeout to SECS
       --connect-timeout=SECS      set the connect timeout to SECS
       --read-timeout=SECS         set the read timeout to SECS
  -w,  --wait=SECONDS              wait SECONDS between retrievals
       --waitretry=SECONDS         wait 1..SECONDS between retries of a retrieval
       --random-wait               wait from 0.5*WAIT...1.5*WAIT secs between retrievals
       --no-proxy                  explicitly turn off proxy
  -Q,  --quota=NUMBER              set retrieval quota to NUMBER
       --bind-address=ADDRESS      bind to ADDRESS (hostname or IP) on local host
       --limit-rate=RATE           limit download rate to RATE
       --no-dns-cache              disable caching DNS lookups
       --restrict-file-names=OS    restrict chars in file names to ones OS allows
       --ignore-case               ignore case when matching files/directories
  -4,  --inet4-only                connect only to IPv4 addresses
  -6,  --inet6-only                connect only to IPv6 addresses
       --prefer-family=FAMILY      connect first to addresses of specified family,
                                     one of IPv6, IPv4, or none
       --user=USER                 set both ftp and http user to USER
       --password=PASS             set both ftp and http password to PASS
       --ask-password              prompt for passwords
       --no-iri                    turn off IRI support
       --local-encoding=ENC        use ENC as the local encoding for IRIs
       --remote-encoding=ENC       use ENC as the default remote encoding
       --unlink                    remove file before clobber

Directories:
  -nd, --no-directories            don't create directories
  -x,  --force-directories         force creation of directories
  -nH, --no-host-directories       don't create host directories
       --protocol-directories      use protocol name in directories
  -P,  --directory-prefix=PREFIX   save files to PREFIX/..
       --cut-dirs=NUMBER           ignore NUMBER remote directory components

HTTP options:
       --http-user=USER            set http user to USER
       --http-password=PASS        set http password to PASS
       --no-cache                  disallow server-cached data
       --default-page=NAME         change the default page name (normally
                                     this is 'index.html'.)
  -E,  --adjust-extension          save HTML/CSS documents with proper extensions
       --ignore-length             ignore 'Content-Length' header field
       --header=STRING             insert STRING among the headers
       --max-redirect              maximum redirections allowed per page
       --proxy-user=USER           set USER as proxy username
       --proxy-password=PASS       set PASS as proxy password
       --referer=URL               include 'Referer: URL' header in HTTP request
       --save-headers              save the HTTP headers to file
  -U,  --user-agent=AGENT          identify as AGENT instead of Wget/VERSION
       --no-http-keep-alive        disable HTTP keep-alive (persistent connections)
       --no-cookies                don't use cookies
       --load-cookies=FILE         load cookies from FILE before session
       --save-cookies=FILE         save cookies to FILE after session
       --keep-session-cookies      load and save session (non-permanent) cookies
       --post-data=STRING          use the POST method; send STRING as the data
       --post-file=FILE            use the POST method; send contents of FILE
       --method=HTTPMethod         use method "HTTPMethod" in the request
       --body-data=STRING          send STRING as data. --method MUST be set
       --body-file=FILE            send contents of FILE. --method MUST be set
       --content-disposition       honor the Content-Disposition header when
                                     choosing local file names (EXPERIMENTAL)
       --content-on-error          output the received content on server errors
       --auth-no-challenge         send Basic HTTP authentication information
                                     without first waiting for the server's
                                     challenge

HTTPS (SSL/TLS) options:
       --secure-protocol=PR        choose secure protocol, one of auto, SSLv2,
                                     SSLv3, TLSv1 and PFS
       --https-only                only follow secure HTTPS links
       --no-check-certificate      don't validate the server's certificate
       --certificate=FILE          client certificate file
       --certificate-type=TYPE     client certificate type, PEM or DER
       --private-key=FILE          private key file
       --private-key-type=TYPE     private key type, PEM or DER
       --ca-certificate=FILE       file with the bundle of CAs
       --ca-directory=DIR          directory where hash list of CAs is stored
       --crl-file=FILE             file with bundle of CRLs
       --random-file=FILE          file with random data for seeding the SSL PRNG
       --egd-file=FILE             file naming the EGD socket with random data

HSTS options:
       --no-hsts                   disable HSTS
       --hsts-file                 path of HSTS database (will override default)

FTP options:
       --ftp-user=USER             set ftp user to USER
       --ftp-password=PASS         set ftp password to PASS
       --no-remove-listing         don't remove '.listing' files
       --no-glob                   turn off FTP file name globbing
       --no-passive-ftp            disable the "passive" transfer mode
       --preserve-permissions      preserve remote file permissions
       --retr-symlinks             when recursing, get linked-to files (not dir)

FTPS options:
       --ftps-implicit                 use implicit FTPS (default port is 990)
       --ftps-resume-ssl               resume the SSL/TLS session started in the control connection when
                                         opening a data connection
       --ftps-clear-data-connection    cipher the control channel only; all the data will be in plaintext
       --ftps-fallback-to-ftp          fall back to FTP if FTPS is not supported in the target server
WARC options:
       --warc-file=FILENAME        save request/response data to a .warc.gz file
       --warc-header=STRING        insert STRING into the warcinfo record
       --warc-max-size=NUMBER      set maximum size of WARC files to NUMBER
       --warc-cdx                  write CDX index files
       --warc-dedup=FILENAME       do not store records listed in this CDX file
       --no-warc-compression       do not compress WARC files with GZIP
       --no-warc-digests           do not calculate SHA1 digests
       --no-warc-keep-log          do not store the log file in a WARC record
       --warc-tempdir=DIRECTORY    location for temporary files created by the
                                     WARC writer

Recursive download:
  -r,  --recursive                 specify recursive download
  -l,  --level=NUMBER              maximum recursion depth (inf or 0 for infinite)
       --delete-after              delete files locally after downloading them
  -k,  --convert-links             make links in downloaded HTML or CSS point to
                                     local files
       --convert-file-only         convert the file part of the URLs only (usually known as the basename)
       --backups=N                 before writing file X, rotate up to N backup files
  -K,  --backup-converted          before converting file X, back up as X.orig
  -m,  --mirror                    shortcut for -N -r -l inf --no-remove-listing
  -p,  --page-requisites           get all images, etc. needed to display HTML page
       --strict-comments           turn on strict (SGML) handling of HTML comments

Recursive accept/reject:
  -A,  --accept=LIST               comma-separated list of accepted extensions
  -R,  --reject=LIST               comma-separated list of rejected extensions
       --accept-regex=REGEX        regex matching accepted URLs
       --reject-regex=REGEX        regex matching rejected URLs
       --regex-type=TYPE           regex type (posix|pcre)
  -D,  --domains=LIST              comma-separated list of accepted domains
       --exclude-domains=LIST      comma-separated list of rejected domains
       --follow-ftp                follow FTP links from HTML documents
       --follow-tags=LIST          comma-separated list of followed HTML tags
       --ignore-tags=LIST          comma-separated list of ignored HTML tags
  -H,  --span-hosts                go to foreign hosts when recursive
  -L,  --relative                  follow relative links only
  -I,  --include-directories=LIST  list of allowed directories
       --trust-server-names        use the name specified by the redirection
                                     URL's last component
  -X,  --exclude-directories=LIST  list of excluded directories
  -np, --no-parent                 don't ascend to the parent directory
Last modification:December 14th, 2018 at 03:30 pm

Leave a Comment