Get http directory size

Here wizards, magicians, sorcerers and everybody can rest a bit and talk about anything they like.

Just remember to respect the rules.

Get http directory size

Postby msdobrescu » Jul 18th, '11, 08:06

Hello,

Assuming I have access to a directory shared through http (by enabling directory listing), how could I count the directory size in depth?
Is there a tool or a script?

Thank you,
Mike
msdobrescu
 
Posts: 213
Joined: Jun 2nd, '11, 07:28

Re: Get http directory size

Postby dotmil » Aug 4th, '11, 00:56

I don't know of any way to easily or reliably do this over http. If you have SSH or shell access you can use the du command:
http://www.linfo.org/du.html
User avatar
dotmil
 
Posts: 22
Joined: Jul 1st, '11, 17:05
Location: Texas, USA

Re: Get http directory size

Postby msdobrescu » Aug 4th, '11, 06:22

I don't.
A script to get recursively the files and sum their size would be good.
msdobrescu
 
Posts: 213
Joined: Jun 2nd, '11, 07:28

Re: Get http directory size

Postby doktor5000 » Aug 4th, '11, 09:48

Well, you can mirror the directory with wget -mk and then measure the local directory size as a workaround, that can be easily scripted.
Cauldron is not for the faint of heart!
Caution: Hot, bubbling magic inside. May explode or cook your kittens!
----
Disclaimer: Beware of allergic reactions in answer to unconstructive complaint-type posts
User avatar
doktor5000
 
Posts: 18051
Joined: Jun 4th, '11, 10:10
Location: Leipzig, Germany

Re: Get http directory size

Postby msdobrescu » Aug 4th, '11, 09:51

The idea is to estimate before I download it.
What if it has 250 GB?
msdobrescu
 
Posts: 213
Joined: Jun 2nd, '11, 07:28

Re: Get http directory size

Postby noneco » Aug 4th, '11, 23:36

You could wget the directory using --spider -S and sum up the content-length.
This should work:
Code: Select all
sum_bytes=0
for content_length in $(wget --spider -S -q  http://some-web-page/ 2>&1 | grep Content-Length: | grep -o "[0-9]\+")
do
sum_bytes=$(($sum_bytes+$content_length))
done
echo "scale=2; $sum_bytes/1024/1024 " | bc

if you use -r and -l you can check recursively and define the depth.
noneco
 
Posts: 3
Joined: Aug 4th, '11, 23:28

Re: Get http directory size

Postby doktor5000 » Aug 4th, '11, 23:52

wget manpage said something about --spider function needing much more work behaving like a real spider,
also that ContentLenth provided from some webservers is sometimes bogus and makes wget go wild:

man wget wrote:--ignore-length
Unfortunately, some HTTP servers ( CGI programs, to be more precise) send out bogus "Content-Length" headers, which makes Wget go wild, as it thinks not all the document was retrieved. You can spot this syndrome if Wget retries getting the same document again and again, each time claiming that the (otherwise normal) connection has closed on the very same byte.

With this option, Wget will ignore the "Content-Length" header---as if it never existed.


But i don't know really, if you say it works, that would be the solution to the problem.
Cauldron is not for the faint of heart!
Caution: Hot, bubbling magic inside. May explode or cook your kittens!
----
Disclaimer: Beware of allergic reactions in answer to unconstructive complaint-type posts
User avatar
doktor5000
 
Posts: 18051
Joined: Jun 4th, '11, 10:10
Location: Leipzig, Germany

Re: Get http directory size

Postby noneco » Aug 5th, '11, 00:03

I tested just tested it, but you are right it may not work always.
Code: Select all
[noneco@nyra2 ~]$ sum_bytes=0
[noneco@nyra2 ~]$ for content_length in $(wget --spider -S -q -r -l2 http://ftp.mandrivauser.de/magazin/ 2>&1 | grep Content-Length: | grep -o "[0-9]\+")
> do
> sum_bytes=$(($sum_bytes+$content_length))
> done
[noneco@nyra2 ~]$ echo "scale=2; $sum_bytes/1024/1024 " | bc
456.02
noneco
 
Posts: 3
Joined: Aug 4th, '11, 23:28

Re: Get http directory size

Postby dotmil » Aug 5th, '11, 00:46

Another possible thing to take into account is the server you are spidering. There are software (and hardware) firewalls that will block your IP temporarily if you hit a specified number of TCP requests in a specified time frame; or permanently block you if triggered multiple times. I have personally seen some that are overly restrictive and limit to say 8-10 requests in a 4-5 second period. I'm sure not many are that tight, but 15-20 or so connections in a 5 second period seems to be on the tight range of normal. Of course there's really no way from an end user perspective to see this until after it happens to you.
User avatar
dotmil
 
Posts: 22
Joined: Jul 1st, '11, 17:05
Location: Texas, USA

Re: Get http directory size

Postby msdobrescu » Sep 16th, '11, 18:58

Thank you guys, in my case the script works brilliantly.
I am totally newbie in sh.
Is there a way to make a script to be run like:

Code: Select all
script.sh http://sample/


Also, how could I run some command from a path relative to the script?

Thank you,
Mike
msdobrescu
 
Posts: 213
Joined: Jun 2nd, '11, 07:28


Return to The Wizards Lair

Who is online

Users browsing this forum: No registered users and 1 guest

cron