( ESNUG 311 Item 2 ) ---------------------------------------------- [2/18/99]
Subject: (ESNUG 308 #7 310 #11) Wait! Our Experience With LSF Was Great!
> I was forced to use a Perl script (written by a fellow graduate student) to
> check the number of jobs I was running and submit jobs as old jobs
> completed. It was not convenient at all.
>
> I also found that some of my jobs would continue running for more than a
> full day. When I killed these jobs they would often report results as if
> they had finished normally. I don't know how long they would have remained
> in limbo if I had not killed them manually.
>
> All in all, I was rather disappointed with LSF. Maybe I expect too much
> functionality without a great deal of work. And, as I said, I cannot
> guarantee that the system was installed and maintained properly.
>
> - David C. Hoffmeister
> University of Maryland
From: Tanya Pobuda <tpobuda@platform.com>
John,
Just a quick note to let you know that Platform's account manager has just
been in contact with the Naval Research Lab. There's a strong feeling there
that this is a misconfiguration or maintainence issue, and our rep and her
support team are working to solve it.
We've got a second call in to pinpoint the problem. I will forward a note to
let you know how it turns out.
Thanks for the tip.
- Tanya Pobuda
Platform Computing Corp.
---- ---- ---- ---- ---- ---- ----
From: Billy Vitro <bvitro@cisco.com>
John,
The LSF software was almost certainly configured completely wrong. The
main feature of LSF is to load share by sending jobs to CPUs that are not
currently being used. It will even monitor machines which are loaded by
jobs run outside the LSF system, providing the jobs were started prior to
LSF submitting the batch job to the host.
This user may have been experiencing something that we see ourselves;
nothing in LSF prevents users from running jobs on hosts outside of LSF
and stealing CPU cycles after the batch job has been started. We got around
that by restricting logins on server farm machines, which keeps people from
bogging them down with non-batch jobs, and allows the LSF software to do
it's job and use the resources most efficiently.
- Billy Vitro
Cisco Systems
---- ---- ---- ---- ---- ---- ----
From: Tom Loftus <tloftus@hns.com>
John,
I have to respond to David Hoffmeister's comments regarding LSF. It sounds
to me like it was not configured properly to match his needs as he suggested
in his e-mail.
I have had very positive results with LSF and praise it very highly,
particularly for ASIC regression testing. However, it is an extremely
configurable tool and requires close cooperation between the administrators
of the tool and the users.
Our biggest problem has been coming up with a "use model" which can be
captured in the LSF config files and implemented on the machines with the
desired results. This is complicated by the fact that we allow both batch
and interactive use of the servers. With that said, here are some specific
responses from a user of LSF versions 3.1 and 3.2.
> Often my jobs were submitted to hosts that had more than one job per
> cpu already while other hosts had multiple idle cpus.
The number of jobs per CPU is a configurable option. When he says "jobs"
I would be curious to see if these were LSF submitted jobs, or jobs running
outside the LSF system. LSF software can't control jobs not submitted
through LSF.
> On top of that, if I submitted too many jobs at once, even with resource
> requirements, it allowed all of them to run and swamped the system.
We have limits on the number of jobs per user setup in a set of queues so
that a user can submit to the queue with the desired amount of parallelism.
Our biggest limit is licenses. We can't let the LSF software chew up all
the Verilog licenses or other users can't work.
> Maybe I expect too much functionality without a great deal of work. And,
> as I said, I cannot guarantee that the system was installed and maintained
> properly.
I am both the administrator and the user of the LSF software and so am
possibly somewhat unique in my ability to tweak it to our needs. But my
experience has shown that if you can describe a set of rules, or use model,
to fit your situation, you can reliably implement it in LSF.
Also, there will always be some cases where a person could make better
decisions than the software following it's simple rules. However, I
maintain that the benefits far outweigh the occasional inefficiencies.
One last comment, I wish they integrated better with flexLM license servers
because I would like better flexibility to suspend jobs and steal licenses
and then resume but I can't do that now.
- Tom Loftus
Hughes Network Systems
|
|