"image"
To learn more about "image" you will find the overview page here.
Discussion during EuroBSDCon hackathon 2007
1. Panellists (unordered)
PawelJakubDawidek, IsaacLevy, BrooksDavis, PhilipPaeps, MarkoZec, SimonNielsen, BjoernZeeb,
2. Notes by IsaacLevy
2.1. Overview of "Image"/Jail meeting
A number of projects are directly related to Jail in the context of virtual servers, yet have more valuable roles and futures on their own. As PHK suggested in NotesEuroBSDConDevsummit2006; , it was discussed that jail should migrate to "image" (A system image)- however a design strategy which keeps various components seperated keeps the jail/image from turning into too much of a singularity or tangent in the OS, as well as making valuable aspects of their use available to the system for use in other contexts/applications.
The following projects have direct impact on jailing:
- Marco Work- TCP/IP virtualization
- Pawel work- ZFS/filesystem
- Brooks work- Resource Control/Job control (post-meeting note added below)
Jail Cleanup:
- get rid of jid
- go with jail names (unique)
2.2. Discussion of a few problems with current jail implimentation:
2.2.1. jails do not always stop and restart immeadiately
- hang from cridental structure
- TCP timeout code...
- Jail will persist until timeout ends (also in devfs code?!?)
- bug in TTY code
- lookup, devfs makes an entry, uses cridental of process of lookup
- devfs will never remove this entry
- THE JAIL WILL NEVER GO AWAY (a killed jail persisting in jls(8) is most visible example)
- bug in TTY code
- Possible solutions?
- always create tty with kernel cridentals
- jails with unique names, (performance problem? with new jails [scanning names, collision detection])
simply use a faster data structure if this becomes a problem
- Like separate exec syscall, (pre-populate jails)
- No chroot with open directory descriptor, but with open files OK.
- Would be nice to create empty jails and execute binary inside the jail
- Jail name (separate from hostname)
NO trust for hostname
- (never been a good strategy, 4.x era or 5-6.x era, real-world administration jail hostnames change often, and jailed users may need simply change hostname)
sidenote: Pawel has patch for jail within a jail http://garage.freebsd.pl/mljail.README
2.2.2. Misc. jail issues
- get rid of sysctls
- manage sysctls per-jail
- raw sockets for one jail, not for another- etc...
Message buffer, be careful- it's quite large
- Priv9 in the kernel, fine grained priviliges possble to allow mask of privs inside a jail
- Maximum set of privs. inside jail, then we can assign privs to jails.
- need to keep child jails in order
- Allow user to mount filesystems, etc...
- in linux, users get different capabilities to mount filesystems (get ucaps, etc...)
- remove setuid root from ping, etc... or setuid gid...
2.3. Marco's Virtualized network stack work
http://imunes.tel.fer.hr/virtnet/
2.3.1. Project Status
- Reasonably stable
- conditionally compilable
- bunch of macros, which can revert back to head with some exceptions of special cases
- removal should not harm system
2.3.2. Implementation Discussion
- The socket knows where it lives, each thread holds a pointer to a minute which needs to be worked on.
- Different threads can operate on different instances
- performance relies on current thread macro, (this is cheap, in the end)
- The Per-CPU macro feels cheap,
- Pawel note, it's not as cheap as reading a pointer
- The Good thing about having this implicit propogation of the context which to operate on,
- every socket is attached to one instance, one socket to each instance at all times
- one can always deduce state of a given process
Pawel sidenote, never knew which IP to operate on before...
- options:
- Case one cannot know, are the timers
- cannot operate without context
2.3.3. Not done
- cleanup of the state.
- killing one, is messy.
- Problematic issue:
- protp pr_init
- record the sequence, so to explicitly instantiate instance,
- replay an instance
- Will replay in reverse order be captured correctly... For init, it seems to work fine-
2.3.4. Before any cleanup can commence:
- the stack must be free of processes sticking
- sockets, and interfaces
- it doesn't attempt to do anything.
- For network emulation, it's cool.
2.3.5. Q: What about vlans- vlan ID collisions or not?
- Retain the association with parent id, you won't know it's a vlan id
- Can create independent Vlan interface inside or outside
- Can assign physical interface to virtual stack
- TSO and fancy stuff working atg full speed...
keep Jail code and virtual stack separate, as well as other resource constriction (general concensus this is an important idea)
- TSO and fancy stuff working atg full speed...
Pawel sidenote: Then can we still call jail jail?
- Default should provide all-in-all package, should be intuitive
- processes which are not virtualized at all, but not network stacks...
2.4. Virtualize filesystems?
2.4.1. Big Change Order:
- 1. Network Stack Virtualization
- 2. Resource Control
- 3. Sep Proc Trees per jail (init each with pid1)
Re. #2,
- Brooks: Summer of code project, not good- serious bug.
'Dummynet' with regard to Resource Control...? Talk to Jeff Roberson about it.
- Pawel: userland threading?
nice(1), renice(8) discussion..
- Group jailed systems, give them say 90% cpu, then each of them gets a smaller chunk of the cpu... Renice inside a jail will not let you upgrade your nice status (globally, on the host), only nice yourself in jail.
3. Addendum, regarding Brooks' work on Process Scheduling
3.1. Notes from an informal conversation with Brian Redman, regarding Process Scheduling
- "The problem was that any process could wipe out the system by swapping. No unix that I know of has a priority scheme to prevent one process for impacting another - ergo hog." (hog source code was distributed with ike's jail(8) lecture materials from EuroBSDCon2007, and is pasted in below for convienence) "At Morgan HPC we used a system first described here: that was commercialized by Softways which became Aurema which now seems to have beem acquired by Cirtrix."
3.2. HOG Source Code, a simple utility to hog memory:
#include <sys/types.h> #include <sys/time.h> #include <sys/resource.h> #include <signal.h> /* written by Brian Redman (BER), sometime around 1986 Disclaimer THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR HIS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ /* Basic Instructions Compile this code to a binary: # cc hog.c -o hog then run something like: # hog 10 - and the hog will do just that- sit and hog 10mb of ram. To run a hog stampede, (a fork bomb): # while (1) # hog 99m& # end # note: BER has used this code to break nearly every stock UNIX system that's # existed since 1986 or so, fork bombs are a complicated problem which nobody # has eloquently resolved in a dynamic manner, to date (2006). # # This code is good for replicating overt system memory leaks as well. */ int dosleep = 0; long rlimit = 0; struct rlimit rl; void catch2(int i) { printf("rlim_cur was: %ld\n", rl.rlim_cur); printf("rlim_max was: %ld\n", rl.rlim_max); rl.rlim_cur /= 2; setrlimit(RLIMIT_RSS, &rl); getrlimit(RLIMIT_RSS, &rl); printf("rlim_cur is: %ld\n", rl.rlim_cur); signal(SIGUSR2, catch2); } void catch1(int i) { if (dosleep) { dosleep = 0; } else { dosleep = 1; } signal(SIGUSR1, catch1); } main(int argc, char *argv[]) { long i, *ip, *p; unsigned long n; long m = 1; signal(SIGUSR1, catch1); signal(SIGUSR2, catch2); printf("%d\n", getpid()); switch (argv[1][strlen(argv[1])-1]) { case 'g': m = 1024; case 'm': m = m * 1024; case 'k': m = m * 1024; argv[1][strlen(argv[1])-1] = '\0'; } n = m * strtoul(argv[1], (char **)NULL, 10); getrlimit(RLIMIT_RSS, &rl); rl.rlim_cur = n+2*1024*1024; setrlimit(RLIMIT_RSS, &rl); if (p = (long *)malloc(n)) { printf("malloced %ld bytes\n",n); while (1) { while (dosleep) { sleep(10); } ip = p; for (i = 0; i < n/sizeof(long); i++) { *ip++ = i; } } } else { printf("failed to malloc %ud bytes\n", n); } }