Multi-Dimensional Data Viewer

MDDViewer is an application for exploring multi-dimensional data using parallel scales that can be filtered and divided. A standalone installer is available for Mac OS. On other platforms you will need to have Java installed on your system to run it. The demonstration page provides additional videos of its key features.

What kind of data does it work on? It works on multi-dimensional data. That is, a collection of data rows, where each row has multiple attributes. Each attribute will define an axis, which is divided into categories or numeric intervals. Hierarchies can be defined for an attribute, allowing a user to choose what level that want a given attribute shown at. Typical tools for viewing such data include: Tableau and Microsoft’s PowerPivot. The key difference between MDDViewer and other such tools is that MDDViewer is based on presenting data in proportionate parallel views, rather than views based on orthogonal axes, providing direct support for mosaic plot like views.

To use this software no database background is needed. The data preparation page provides a complete description of the steps needed. However, for those with such a background, the approach taken is: the preparation tool PrepData aggregates a table of point data facts into buckets (i.e. the cells of an OLAP base datacube) using supplied dimension descriptions. Each dimension may be a flat list of categories or intervals or form a hierarchy. The fact table and dimension tables together conform to an OLAP star schema. Measures may also be included. If no measures are included, counts i.e. frequency is used.

There are some queries that can be answered more easy with the proportionate parallel views of MDDViewer. In particular, queries that involve comparison of proportions. For example in a workers census dataset, when comparing male and female income distributions it would be better to see how the data looks after compensating for differences in occupation or other attributes. MDDViewer supports this with one operation that reweights the data and changes the view with single animated transition.

Mosaic Plot Views: Titanic Data
The Titanic was carrying over 2,000 passengers and crew when it sank in 1912 with more than half of the passengers and crew perishing. The data used here is sourced from the Department of Biostatistics at Vanderbilt University School of Medicine’s website. It is a partial dataset containing 1309 rows of passenger data including name, sex, class, fare, embarkation port, family count, age and survived status. This data can be presented by MDDViewer as a multi-panel mosaic plot; where each panel shows categories whose width is proportional to the number of passengers in it. Each category is divided into green (survived) and blue (did not survive) by proportion. We can see most passengers perished (the bottom panel), but most females survived. Survival improved with the cabin class and ticket fare. Survival also improved if there was 1, 2 or 3 accompanying family members, but went down for passengers with 4 or more family members.

When a panel contains only a few categories, it can be rotated to give a better view. It shows that, while most passengers were in 3rd class, the survival rate for 3rd class was less than half the survival rate for 1st class. This is a typical mosaic plot, showing both the size of each category and any over-or-under representation (survival in this case).

The titanic data can also be presented with a traditional bar chart view. Consider the cabin class panel. It shows the number of survivors in 1st class and 3rd class was similar. But reading the extent of over or under representation at a glance is difficult as the total size of each category varies. Simply focusing on the green bars would be misleading. The mosaic plot avoids this.

Another difficultly with the views provided so far is that there are two very distinct cohorts with different survival rates: female and male. It would be better to separate them. This analysis and the supporting data preparation continues on another page.

Compositional Adjustment: Controlling for a Covariant
A common problem when exploring multi-dimensional data, when focussing on one attribute, is not taking into account variations in other attributes. For example, consider the comparison of male and female incomes in a workforce dataset. A simple comparison may show the percentage of males with high incomes is much greater than the percentage of females with high incomes. However, this does not mean that males are paid more for the same work, as a greater proportion of females may be in lower paid occupations, and conversely a higher proportion of males in the higher paid occupations.

A statistical analysis that includes regression can help here. But the objective of MDDViewer is to answer such question without using statistics, by only using a simple data transformations that will seem natural and so be understandable to most users. In this case, to answer are females being paid less for the same work, the data can be dynamically adjusted to compensate for male-female occupational differences, or differences in any other attribute that might affect income such as age. Support for this adjustment* is a key feature of MDDViewer.

I’ve created an example where there appears to be a difference in one attribute, but the apparent difference is fully accounted for by differences another attribute. The example divides workforce data into gender: male and female, and into occupation: X and Y. A proportional mosaic style plot has been used. Gender is the vertical axis, while occupation is the horizontal axis. The size of each square is proportional to the number of workers in it. Occupations X and Y have the same number of works shown as the same area. On third of occupation X is male. One sixth of occupation Y is female. Occupation X has a salary of 50K, while occupation Y has a salary of 80K.

The average salary for females and males can be calculated as: 56K for females and 71K for males. Female workers appear to be paid less than males. But from the above chart it can be seen that most females workers are employed in the lower paid X occupation. Given the above skue in the data, a better question would be: if male and female workers had the same occupation distribution, what is the difference in income? To answer this the above chart needs to be rescaled to:

The total number of workers has been preserved. The total number of male (58%) and female (42%) workers as been preserved as well. However, females in occupation X have been reweighted (shrunk) by a factor of 0.63 (0.42/0.667) to reach 0.42, while males have been reweighted (boosted) by a factor of 1.74 (0.58/0.333). Males in occupation Y have been reweighted (shrunk) by a factor of 0.70 (0.58/0.833) to reach 0.58, while females have been reweighted (boosted) by a factor of 2.5 (0.42/0.116) to reach 0.42. All of this reweighting has now brought the distribution of males and females in occupation X and Y into alignment, as the male/female split in each occupation is now the same, it is 0.58/0.42.

The figure below shows how the reweighting is done. The example shows 18 workers; 6 females and 12 males. Occupation X is evenly divided into females and males, while in occupation Y, females are under-represented, there are only 2 out of 12. The male occupation distribution i.e. X to Y is 1 to 2. To compositionally adjust, female workers are reweighted so their occupation distribution (X to Y) also becomes 1 to 2 whilst preserving the total number of female workers. The lower mosaic plot shows the result. Each occupation X female has been reweighted to half a worker while each occupation Y female has been reweighted to two workers. Note that while this transformation brings the female and male occupation distributions into alignment as evidenced by the resulting mosaic’s grid pattern, the size of occupation slice is not preserved.

After bringing female occupation distribution into alignment with the male distribution via reweighting, average income can be calculated again. The average income is now 65K . The earlier apparent difference in incomes is fully accounted for by occupational differences. Clearly for real data the situation will be somewhat different. Not only because of a difference in pay between the two genders, but because there will also be differences across other attributes such as education level, age and so on.

A Bigger Example
The above example shows the importance of adjusting data for the effect of variation in attributes such occupation, education, age etc before making the income comparison. The following video shows MDDViewer doing this analysis on a US workforce (1994) dataset from the UCI Machine Learning Repository (Adult Data Set).

To see a demonstration of MDDViewer supporting an analysis of another dataset, visit the exploring a dataset page.

PrepData
A data visualisation tool is not much use without a supporting data preparation capability. MDDViewer includes such a capability. To try MDDViewer on sample data or on your own data visit the download page.

Acknowledgement*

The key idea of reweighting the data to create a counterfactual that removes the effect of variation in one or more other attributes when looking at a given attribute is called compositional adjustment. I learnt of this idea when reading chapter 7 of Mark Handcock and Martina Morris’s book “Relative Distribution Methods in the Social Sciences”, published by Springer, 1999. A summary also appears in their journal paper “Relative Distribution Methods” in Social Methodology Methodology, volume 28, no. 1, pages 53-97, 1998.