A Better MergeSort for FreeBSD

Summary

The current libc implementation for mergesort is fundamentally flawed in that it does not allow sorting data which has an element size less than sizeof(void *) / 2. I will develop a pointer width agnostic Mergesort remove this constraint which limits FreeBSD’s Mergesort on machines with larger pointer sizes. Moreover, I aim to improve the speed of Mergesort and ideally decrease the space requirements for its use.

The Project

Element width

The core problem of this GSoC project is dissecting the implementation in

https://github.com/freebsd/freebsd/blob/master/lib/libc/stdlib/merge.c

to determine the critical sections that depend on the size of the elements being greater than sizeof(void *) / 2 in order to discover the purpose of this restriction. This is likely a trade off in exchange for some optimization. I plan to use the Community Bonding period to do more extensive code reading and tests, as well as ask the opinion of more experienced contributors, in order ensure I’m prepared for the start of the coding period.

Efficiency

I aim to improve the practical runtime and space complexity of mergesort based on the papers:

which implement an in-place block Mergesort using O(1) additional memory. I will either port a working block Mergesort from https://github.com/Mrrl/GrailSort or https://github.com/BonzaiThePenguin/WikiSort and tinker with them to optimize and style the implementation for FreeBSD, or write my own implementation from the algorithm pseudocode laid out in the paper if the code cannot be ported. The challenge here for me is my lack of knowledge of open-source software licensing. I need to learn what software can be copied and what software must be implemented myself.

Deliverables

A pointer width agnostic Mergesort

I will either change FreeBSD’s mergesort to support any element size or replace it with a comparable or better mergesort that does.

A Mergesort benchmark suite

I will write a set of benchmark scripts which demonstrate the speed improvement the new Mergesort has on different data sets compared to the previous Mergesort. These will likely be bash or python scripts that call to gprof and pmcstat.

Timeline

Community Bonding

My plan during the community bonding period is to communicate with contributors over the mailing list and do a deep reading of the mergesort implementation in order to determine the purpose of the restriction on element width. Furthermore, I will use this time to learn BSD style and licensing rules, as well as look over my labs copy of The Design and Implementation of the FreeBSD Operating System.

Weeks 1-2 (May 30 - June 11)

Task: Write a battery of tests to support the correctness of my implementation.

Approach: Write C unit tests which use the mergesort function in as many different ways as I can conceive. Aim to have tests which emulate likely mergesort workloads, as well as test for corner cases. This will be done early such that the implementation that is written can be tested immediately.

Deliverables: test/mergesort_tests.c: A rigorous correctness testing suite

Week 3 (June 12 - June 18)

Task: Decide on an algorithm and demonstrate that it can be implemented as pointer width agnostic.

Approach: Use the information gathered over community bonding period in order to decide if the current implementation can be restructured to be pointer width agnostic. If not, compare the pros and cons of the two block sorting algorithms in the referenced papers in order to determine which is preferable.

Deliverables: A plan of action which notates the algorithm chosen as our mergesort replacement and why it is believed to be superior.

Plan of Action

To modify WikiSort over the coming weeks to fit the freebsd mergesort specifications. Including the following tasks:

Weeks 4-7 (June 19 - July 16)

Task: Implement the chosen algorithm.

Approach: See Plan of Action above.

Deliverables: lib/libc/stdlib/merge.c: A backwards compatible, pointer width agnostic, mergesort implementation.

Weeks 8-10 (July 17 - August 6)

Task: Write benchmarking scripts in order to support the time and space usage superiority of my implementation.

Approach: Write a C program which tests merge.c against different data sizes and data patterns for runtime. Use a hardware performance utility to profile the memory usage of each implementation as well.

Deliverables: test/mergesort_benchmark.c: A comprehensive benchmarking suite which exhibits the improved mergesort’s runtime and memory usage.

Weeks 11-13 (August 7 - 29)

Task: Documentation, style audit, and merge to production.

Approach: Latex a pdf (or whatever format of documentation is preferable) containing all the design decisions and optimizations that went into the new mergesort. Display the improvement that the new mergesort provides, with before and after data visualization. Edit the man page for mergesort, removing the restriction on element sizes. Look through merge.c and the C unit tests and ensure that they conform with the style guide in the FreeBSD man pages. Lastly, share the changes with the mailing list, requesting comments and concerns before requesting the files be merged into stdlib.

Deliverables: Mergesort.pdf

Background

Personal Information

Contact Information

The Code

https://github.com/milesfertel/GSOC-Mergesort https://github.com/milesfertel/freebsd

Results and Documentation

Summary

The final deliverables of this project are:

Changes

Wikisort

Testing Suite

Benchmarking

Instructions

Wikisort

Testing Suite

Benchmarking

  Usage:
    ./mergesort_bench [alg] [test] [runs] [elt_power]

  Valid algs:
    heap merge quick
    
  Valid tests:
    rand sort part rev
        rand: Random element array
        sort: Increasing order array
        part: Partially ordered array
        rev: Decreasing order array
        
  Run the algorithm [runs] times with 2^[elt_power] elements

        Usage: python bench.py [elts]

Files will be output in a new folder called stats with separate files for each statistical comparison and the raw results in a subfolder called data. Bench will optionally run 2 elts elements as the dataset size when provided. Otherwise it will run 2 20 elements.

Contribute

Wikisort

Testing Suite

Benchmarking

Support

If you are having issues, please let us know. We have a mailing list located at: freebsd-hackers@freebsd.org or you can email me personally at milesfertel@college.harvard.edu

License

The project is licensed under the BSD license and the Unlicense.

SummerOfCode2017/Mergesort (last edited 2017-08-28 13:21:37 by MilesFertel)