Page 1 of 1

Get http directory size

PostPosted: Jul 18th, '11, 08:06
by msdobrescu
Hello,

Assuming I have access to a directory shared through http (by enabling directory listing), how could I count the directory size in depth?
Is there a tool or a script?

Thank you,
Mike

Re: Get http directory size

PostPosted: Aug 4th, '11, 00:56
by dotmil
I don't know of any way to easily or reliably do this over http. If you have SSH or shell access you can use the du command:
http://www.linfo.org/du.html

Re: Get http directory size

PostPosted: Aug 4th, '11, 06:22
by msdobrescu
I don't.
A script to get recursively the files and sum their size would be good.

Re: Get http directory size

PostPosted: Aug 4th, '11, 09:48
by doktor5000
Well, you can mirror the directory with wget -mk and then measure the local directory size as a workaround, that can be easily scripted.

Re: Get http directory size

PostPosted: Aug 4th, '11, 09:51
by msdobrescu
The idea is to estimate before I download it.
What if it has 250 GB?

Re: Get http directory size

PostPosted: Aug 4th, '11, 23:36
by noneco
You could wget the directory using --spider -S and sum up the content-length.
This should work:
Code: Select all
sum_bytes=0
for content_length in $(wget --spider -S -q  http://some-web-page/ 2>&1 | grep Content-Length: | grep -o "[0-9]\+")
do
sum_bytes=$(($sum_bytes+$content_length))
done
echo "scale=2; $sum_bytes/1024/1024 " | bc

if you use -r and -l you can check recursively and define the depth.

Re: Get http directory size

PostPosted: Aug 4th, '11, 23:52
by doktor5000
wget manpage said something about --spider function needing much more work behaving like a real spider,
also that ContentLenth provided from some webservers is sometimes bogus and makes wget go wild:

man wget wrote:--ignore-length
Unfortunately, some HTTP servers ( CGI programs, to be more precise) send out bogus "Content-Length" headers, which makes Wget go wild, as it thinks not all the document was retrieved. You can spot this syndrome if Wget retries getting the same document again and again, each time claiming that the (otherwise normal) connection has closed on the very same byte.

With this option, Wget will ignore the "Content-Length" header---as if it never existed.


But i don't know really, if you say it works, that would be the solution to the problem.

Re: Get http directory size

PostPosted: Aug 5th, '11, 00:03
by noneco
I tested just tested it, but you are right it may not work always.
Code: Select all
[noneco@nyra2 ~]$ sum_bytes=0
[noneco@nyra2 ~]$ for content_length in $(wget --spider -S -q -r -l2 http://ftp.mandrivauser.de/magazin/ 2>&1 | grep Content-Length: | grep -o "[0-9]\+")
> do
> sum_bytes=$(($sum_bytes+$content_length))
> done
[noneco@nyra2 ~]$ echo "scale=2; $sum_bytes/1024/1024 " | bc
456.02

Re: Get http directory size

PostPosted: Aug 5th, '11, 00:46
by dotmil
Another possible thing to take into account is the server you are spidering. There are software (and hardware) firewalls that will block your IP temporarily if you hit a specified number of TCP requests in a specified time frame; or permanently block you if triggered multiple times. I have personally seen some that are overly restrictive and limit to say 8-10 requests in a 4-5 second period. I'm sure not many are that tight, but 15-20 or so connections in a 5 second period seems to be on the tight range of normal. Of course there's really no way from an end user perspective to see this until after it happens to you.

Re: Get http directory size

PostPosted: Sep 16th, '11, 18:58
by msdobrescu
Thank you guys, in my case the script works brilliantly.
I am totally newbie in sh.
Is there a way to make a script to be run like:

Code: Select all
script.sh http://sample/


Also, how could I run some command from a path relative to the script?

Thank you,
Mike