Zombies are now a common topic of discussion. In fact, the data we have available from Google Trends (for the phrase "zombie attack") strongly suggest an increasing risk of zombification across the world:
However, academic research on zombies is limited (i.e,. nonexistent), mainly because of the lack of high quality data. For those interested in studying zombies, I refer readers to Andrew Gelman's paper (cowritten, apparently, by the great zombie film director George Romero) on how to measure zombie outbreaks via indirect survey techniques. You can find his article here. Even if you're not interested in zombies, his paper offers some good ideas on how to sample difficulttoreach populations more generally.
Thursday, May 31, 2012
Friday, May 11, 2012
The Promising Future of Mathematical Sociology
I'm now an occasional blogger at Permutations, the official blog of the Mathematical Sociology Section of the American Sociological Association. You can read my blog post here, in which I outline why I think global trends in information technology and the metatheroetical foundations of sociology provide conditions for a promising future for sociology in general and mathematical sociology in particular.
Thursday, May 10, 2012
90+ TwoMinute Videos on R
I highly recommend Anthony Damico's excellent twominute videos on programming in R. You can find the full list of 90+ videos here. This is the first of the series, which tells you how to download and install R:
More generally, Anthony's video collection is another reminder of the immense sociological benefits that come from sharing educational materials and expert knowledge in the style of the Khan Academy.
More generally, Anthony's video collection is another reminder of the immense sociological benefits that come from sharing educational materials and expert knowledge in the style of the Khan Academy.
Tuesday, May 08, 2012
Global Online Conference on Statistics
The Consortium for the Advancement of Undergraduate Statistics Education is hosting a global online conference titled "eCOTS: Electronic Conference on Teaching Statistics." You can view the full program here. It only costs $15 to register and participate in the online conference. For at least the past five years I've thought that conferences are obsolete in many respects, so I'm delighted to see this conference developed. By not having a physical place, with food, beverages, and equipment, not to mention lodging and transportation costs, the costs of attendance are much lower, thus enabling more and more people to learn and contribute to knowledge production. (Of course, we'll still want some conferences for facetoface socialization!)
Sunday, May 06, 2012
I've Converted to R FullTime
It's been over four years that I've been using both R and Stata, but as of last week I've become an R convert. For several years I had conducted statistical analyses in R (since many complex models can only be programmed in R), but I used Stata before and after the analyses. In essence I'd merge and clean data sets in Stata, call R from Stata for the statistical analyses, export R objects into Stata, and then use Stata's graphics utilities to display the results. This setup quickly unraveled last month when I began merging and recoding data in R, which is much aided by John Fox's fantastic "car" package.
The problem is that if you want to do Bayesian analysis or graph modeled coefficients (or work with complex data structures more generally), then R is much easier than Stata due to the objectoriented programming environment. It's unbelievably liberating to be able to save vectors, matrices, data frames, and so on from multiple data sources and manipulations in the same conceptual space. Additionally, R has fantastic graphics capabilities (3D plots, rotating hyperplanes, social network graphs, and so on), offers excellent tools for analyzing and displaying socalled big data (for example, check out the "tabplot" command from Google), and is (frankly) a fun, intuitive programming language. If you need additional reasons to be an R convert, keep in mind that R is completely free, opensource, and extensible, with over 5,300 statistical packages (as of April 2012).
The problem is that if you want to do Bayesian analysis or graph modeled coefficients (or work with complex data structures more generally), then R is much easier than Stata due to the objectoriented programming environment. It's unbelievably liberating to be able to save vectors, matrices, data frames, and so on from multiple data sources and manipulations in the same conceptual space. Additionally, R has fantastic graphics capabilities (3D plots, rotating hyperplanes, social network graphs, and so on), offers excellent tools for analyzing and displaying socalled big data (for example, check out the "tabplot" command from Google), and is (frankly) a fun, intuitive programming language. If you need additional reasons to be an R convert, keep in mind that R is completely free, opensource, and extensible, with over 5,300 statistical packages (as of April 2012).
Friday, May 04, 2012
Complex Sociotechnical Systems
In a fascinating, informative talk, the interim director of the Engineering Systems Division at MIT makes the case for a new field of study on complex sociotechnical systems. I ask a question near the end of the video, pointing out that the core concepts of the proposed new field are in fact those endemic to sociology: mixed methods, open systems, social change, and so forth. You can watch the full video here.
Saturday, April 21, 2012
Feynman on Curiosity
This video is one of the effective advertisements I've seen for the value of gathering and systematizing empirical knowledge, by none other than the late Richard Feynman:
Also, since you are probably wondering: the music is Primavera by Ludovico Einaudi.
Also, since you are probably wondering: the music is Primavera by Ludovico Einaudi.
Tuesday, April 17, 2012
The Future of the Academy in 2032
Just before he died, for a few years I helped the great sociologist Dan Bell with using his computer, and as a result I got to know him very well. One thing I learned from him (besides the distinction between "criticism" and "critique") is the usefulness of prediction as an endeavor in itself (as opposed to explanation). In this spirit, I offer five predictions about the future of the academy in 2032:
 First, despite opposition from many established institutions, there will be an enormous increase in opensource education. Classes on any topic will be available online for free, with lecture notes, videos, presentations, and chat services (with other students) available to anyone with a computer. Exemplars of this trend include MIT OpenCourseWare, Khan Academy, and videolectures.net.
 Second, academic publishing will be increasingly online, with peer review a continuous process. Rather than books and articles published at one time in paper form after a process of peer review, academic projects will be ongoing, processoriented, available online, and subjected to a continual process of peer review. In essence, everything that academics produce will be worksinprogress, and updated when errors are noted. Early indications of this trend include the NBER archive and arxiv.org.
 Third,due to technological changes and increased monitoring of people's activity, academics will have to be adept with managing and analyzing big data. Common statistical methods will often be difficult to use on such large data sets, straining the computational capacities of computers. While not common in the academy yet, big data is one of the top buzzwords of 2012, and I expect this to spread to academic work relatively soon. An exemplar of this kind of academic work is the Google ngrams project. (One danger, however, is that private corporations might be hostile to informationsharing, and the values of profitmaking may severely inhibit the availability of big data to academics.)
 Fourth, big ideas will actually be in greater demand in the future. Precisely because there will increasingly be an excess of information, grand theories and master narratives will be increasingly desired to help guide attention, avoid fragmentation of different research traditions, and unify otherwise disparate theories. For example, Josh Tenenbaum's efforts at unifying artificial intelligence (which suffers from disciplinary fragmentation) with probabilistic graphical models is a promising endeavor.
 Finally, the skills in demand will be increasingly modular rather than topical. For example, as part of the Cold War in the 1960s, the United States government funded various "area studies" programs to educate Americans on the traditions, customs, and practices of various geographic regions around the world. In the future, there will be less emphasis on this kind of topical knowledge, and greater emphasis on modular skills such as critical analysis of any kind of texts or arguments, understanding the basic structures of any set of languages, and gathering and analyzing various kinds of qualitative and quantitative data.
Wednesday, April 11, 2012
The Quantified Self
This site on the quantified self shows a small but growing revolution: using quantitative data for selfimprovement. I can only expect this to grow in importance. Despite their popularity, means, modes, medians (in their conditional variants as well) simply capture central tendencies, and that there is nearly always substantial heterogeneity within and across populations. Accordingly, basic proscriptions and prescriptions, such as "Take an aspirin a day" may not apply to all individuals, and thus individual tracking is potentially extremely useful. For example, see Seth Robert's blog post on how eating butter might improve cognitive functioning (for him, at the very least).
Misc. Lectures Online
I highly recommend the following lectures for anyone interested in social science research using quantitative methods:
 The late Sam Roweis (a brilliant educator who died unexpectedly several years ago) gives a superb introduction to machine learning and probabilistic graphical models here, complete with lecture slides. In case you aren't aware, probabilistic graphical models are in effect a unifying approach to a wide range of statistical models, from hidden Markov models to hierarchical Bayesian models.
 Salman Khan, the MIT graduate who started the eponymous Khan Academy, offers a superb series of lectures on probability, available here. Probability is actually the foundation for quantitative research in the social sciences, since much of the goal of inference is to quantify uncertainty through the use of probability distributions such as the Gaussian, Poisson, Gamma, and so forth.
 Although for programmers in python, the computer scientist Allen Downey gives a thorough, intuitive, and entertaining overview of Bayesian analysis, which you can view in its entirety here.
Tuesday, April 10, 2012
Biplots in Stata
I've been examining qualitative data using biplots, which are readily available in Stata using Ulrich Kohler's excellent package. For example, here is a biplot of a rich data set of poor white men on variables such as drug use and other risk factors:
There are several useful features of biplots: first, they concisely summarize a wealth of information in one graph, including relationships among both cases and variables; second, in line with Tufte's dictum, biplots have a high datatoink ratio; third, since cases are not directly modeled, biplots help with integrating qualitative and quantitative data (i.e., cases are not "hidden" by a hyperplane, as in a classical linear regression model); finally, there are absolutely no frequentist statistics to deceive the analyst.
Wednesday, April 04, 2012
Top 5 Unsolved Sociological Questions
Physicists and other natural scientists often spend time specifying and focusing attention on unsolved questions, such as how particles obtain mass, the origins of dark matter, and how time is related to entropy. In general, I think it's a good practice for any field of endeavor to revisit the questions that are stubbornly and perplexing unsolved, including sociology. Thus, in this spirit of refining our ignorance (and clarifying our sociological "known unknowns"), here is my list of the top unsolved sociological questions of the early 21st century:
 What is causing the unprecedented, nearlymonotonic drop in crime rates across the developed world over the last several decades? As the NYT mentions, this question has been perplexing criminologists and sociologists, and everything from changing demographics to the legalization of abortion has been cited (although the latter cause is most probably incorrect, pace Steven Levitt).
 Why are various forms of inequality increasing across the developed world, from Sweden to the United States, since the early 1970s? Although many sociologists and economists have focused on technological change, immigration rates, and deunionization, deeper causes (such as those related to political institutions or social structures) remain largely unexplored.
 Why do so many cultural and social phenomena (such as the frequency of words in the English language, size of cities across the globe, and amount of wealth across individuals) follow powerlaw distributions when plotted by size (or frequency) and rank? Explanations have focused on preferential attachment (popularly articulated by Herbert Simon) and information efficiency costs (as outlined by Benoit Mandelbrot), but thus far we have no conclusive evidence for favoring any particular mechanism over others.
 How does culture (defined as values, norms, attitudes, and beliefs) result in different economic and political outcomes across groups? Since the time of Max Weber, the causal effect of culture on human behavior has baffled sociologists and other social scientists, in part because of the apparent intractability of measuring culture and clearly linking it to economic and political outcomes. As a result, answering this question is an open, fertile area of empirical and theoretical exploration.
 Why is the United States unusually politically conservative and religious compared to other developed countries? At least since Tocqueville sociologists, including the late Seymour Martin Lipset, have puzzled over why the United States has exhibited a kind of cultural "exceptionalism" (in the nonnormative sense), with relatively high levels of religiosity and political conservatism. Although many explanations have been offered, a satisfactory account has remained stubbornly elusive.
Tuesday, April 03, 2012
Making Books
This video makes me wonder how, although technology has innumerable benefits, some aspects of culture will be lost if we don't retain at least some working knowledge of older technologies:
Monday, April 02, 2012
The Limits of Formal Theory in Sociology
Sociologists and economists often disagree about the role of socalled "formal" theory in understanding social behavior. For the most part, sociologists are much more skeptical that mathematical models (with little reference to data) can clearly and accurately describe, explain, and predict how humans act, think, and feel. I take a middleoftheroad position: such models of human behavior can be helpful for illuminating arguments, but often they are such crude approximations of reality that they can obscure what is actually going on. I'm reminded of Max Tegmark's brilliant article on the mathematical universe hypothesis, in which he claims that the universe is a giant mathematical structure. In fact, the disciplines can be understood in reference to derivations from known mathematical laws, as shown in this diagram:
The problem, as Tegmark suggests in this diagram, is that until we understand how to reconcile mathematically general relativity and quantum field theory, as well as how this reconciled theory is related to other fields in physics and related fields, mathematizing sociology will at best be a set of (possibly crude) approximations of reality.
Friday, March 30, 2012
Physics Envy
The NYT published an oped today by a pair of political scientists on "physics envy" by sociologists, economists, and political scientists. The authors mainly argue that theory can be useful even when it is wrong or unsupported by data, and briefly mention that data analysis is useful even if theoretical contributions are not obvious. I disagree with the former, but not the latter. For a similar view, see this post by the theoretical physicist Sean Carroll.
Thursday, March 29, 2012
Irving Louis Horowitz
The eminent political sociologist died a few days ago, according to an obit in the NYT. Long ago I read, and took seriously, his book The Decomposition of Sociology, in which he argues (essentially) for more empirical analysis and less leftwing politics in sociology. Reflecting on his book, he neglects a fundamental, possible cultural contradiction: to the extent social reality exhibits facts consistent with liberalism and inconsistent with conservatism, empirical analysis will result in more liberal than conservative belief systems (but not values, since those cannot be proven "right" or "wrong" by scientific analysis). For example, evidence is accumulating that economic inequality (which is of little concern to most conservatives in the United States), has numerous deleterious effects, thus forcing conservatives either to hold beliefs inconsistent with the evidence (i.e., inequality is unrelated to deleterious effects) or alter their values (i.e., it is a "good" thing to have high rates of violence, low social mobility, and so forth).
Wednesday, March 28, 2012
MyPersonality
I highly recommend this website for learning about your attitudes, values, beliefs, and overall personality.
Sunday, March 25, 2012
Why are Economists so (Consistently) Led Astray About Inequality?
In a recent Boston Globe article Ed Glaeser, a conservative urban economist at Harvard, wrote an article titled Why income disparity in Boston isn't a bad thing. Glaeser is right that inequality increases in a city such as Boston can be due to selection effects, since poor people are moving into Boston for economic and cultural opportunities. Yet these selection effects (i.e., poor people moving into a geographic area in the hopes of upward mobility, which is generally considered a good thing) is drastically different from the observed outcomes (i.e., large disparities in people's wealth due to their social positions in a system of occupations, which is generally considered a bad thing). Yet Glaeser conflates the two, confusing the reader and, perhaps, himself. A more accurate title for the article would have been "Why poor people moving into Boston isn't a bad thing." This raises a question: why are economists so (consistently) led astray about the causes and consequences of economic, social, and political inequality?
Popularity of Programming Languages
As you can see, R is relatively popular (but more so on StackOverflow than GitHub):
For the original graph, click here. This scatter plot is a reminder that R is useful to learn not only for statistical modeling (since there are so many excellent packages available), but also as a way to become familiar with programming more generally.
Saturday, March 24, 2012
Big Science and Sociology
I highly recommend this video featuring Dirk Helbing, a sociologist and erstwhile physicist who is (along with others) attempting to create a CERNlike societysimulating project for the social sciences by combining information from large data sets with simulated models of complex social systems:
Thursday, March 22, 2012
Statistical Lexicon
Anyone doing statistical analysis (or contemplating it) should read Andy Gelman's informative, humorous, and deadon correct post on statistical lexicon.
McKinsey on Big Data
McKinsey has a full report (from March 2011) describing the meaning and potential impact of socalled big data. You can read the report here. One problem, which the authors of the report do not discuss in detail, is the that since so much of what constitutes big data will be collected by private firms there are possibilities of restricted information pockets. In other words, only certain private actors will have access to big data, and academics might very well be left very few big data sources.
Wednesday, March 21, 2012
Inequality: Everyone's Thinking About It
I ran into the following articles on inequality, which has not only been increasing structurally but culturally (in that more policy elites and journalists are discussing the topic openly). Here are some recent posts on inequality:
 Reuters is reporting findings from a group of researchers showing that Sweden has undergone an enormous increase in inequality, especially since the rise of the centerright in the political system. For those of us in the United States who look to Sweden as a model of development, in recent years even this country has regressed from the ideals of social democracy.
 Based on an online survey (with all the caveats about sampling procedures, of course), a group has surveyed wealthy Americans on their views on inequality. The biggest finding, which reinforces the importance of classbased analyses of electoral politics: among the wealthy there is a huge gap between selfidentified Republicans and Democrats, with over 84% of the latter favoring policies taxing the rich while around 29% of the former.
Universal Limits in HighDimensional Statistics
The MIT Center on Operations Research is hosting a talk tomorrow on universal limits in highdimensional statistics. The basic idea is that, for all fields of empirical study from sociology to highenergy physics, some criterion for "statistical significance" is crucial for making decisions based on the data. (The current hunt for the Higgs Boson particle is in fact based on a modified criterion for statistical significance.) The problem, however, is that we are entering a world of big data, in which data structures have many dimensions, thus altering the potential usefulness of such criterion for statistical significance.
Sunday, March 18, 2012
Rethinking Tragedy and Success
The social theorist Alain de Botton presents a creative rethinking of the meaning of tragedy and success in a TED talk, shown here:
In essence, he argues that success needs to be rethought using insights
from sociology, including an understanding of the limits of the ideal of
a meritocratic society (since there is always random chance involved in social mobility), a deeper awareness of how failure as a concept
involves particular beliefs and values (so that we can conclude that
Hamlet is not a "loser" even though he "lost"), and a sensitivity to the fact
that even when particular social and cultural distinctions appear to be irrelevant economic differences certainly are not (so that comparing oneself to Bill Gates rather than the Queen of England is just as absurd, even though the former wears "business casual").
Saturday, March 17, 2012
Why Inequality Matters
The conservative magazine Commentary has published an article on how social inequality is on the political agenda and on the minds of most Americans,
even though many conservatives would prefer the case to be otherwise.
The authors argue that, in part, the discussion of inequality should be
oriented toward social mobility and poverty, as well as the "injustices"
of government policy. What the authors apparently fail to realize is
the possibility that inequality causes poverty and immobility,
not to mention "unjust" government policies perpetuating inequality. In
particular, higher inequality can cause low social mobility by
increasing socioeconomic distances between the highest and lowest rungs
of society, higher rates of poverty by segregating groups and distorting
resource allocations, and inequalityperpetuating government policies
by shifting costs from the wealthy to the general population (through,
for example, cutting funds for widelyavailable public services and
increasing takehome profits from private organizations).
Friday, March 16, 2012
Inequality "Crisis" of Marriage
The Atlantic Monthly posted a fascinating article today on the inequality "crisis" of marriage. My favorite line in the article: "Gone are the days when the
Harvard grad marries the girl with the high school degree simply
because, well, she's pretty."
Thursday, March 15, 2012
Corporate Culture Revisited
Greg Smith has a popular post in the NYT titled Why I am Leaving Goldman Sachs. His reason is that the organizational culture is now "as toxic and destructive" as has "ever seen it." In particular, Smith criticizes that the values and norms of the organization are oriented almost exclusively toward profitmaking, with little or no regard for the wellbeing of other organizations and people, including their clients.
Wednesday, March 14, 2012
Misc. Links
 MIT students are having a Pi Day recitation and celebration today (since today is 3.14, of course).
 The Financial Times discusses Goldman Sachs' corporate culture without, unfortunately, describing what is meant by the phrase; however, I'm glad to see that cultural factors are mentioned, since clearly faulty beliefs, norms, and values contributed to financial crisis.
 The U.S. Census Bureau recently released a report describing the inequality levels (expressed as Gini coefficients) of all counties in the United States from 2006 to 2010; the findings show, as one would expect, that more populous counties are more unequal.
 Finally, a new study suggests that firstgeneration immigrants face a disadvantage in attending college due a "cultural mismatch" in values and norms from between workingclass youth and those from middle and upperclass backgrounds.
Tuesday, March 13, 2012
MIT Inequality Talk
Scatter Plot Matrix in R
Stata has a large number of graphics capabilities (and I highly recommend Stata over other statistical packages for a variety of reasons), but in a few instances R is more useful. In particular, I find R useful for creating beautiful scatter plot matrices and 3D graphical displays. To my knowledge, currently these kinds of graphics are very difficult (if not impossible) to create in Stata 12. What I like about scatter plot matrices is that can have a high datatoink ratio, packing together fitted lines, scattered data, histograms, correlations (proportional to the size of the correlation), and statistical significance "stars" (since reviewers seem to like them). Moreover, I like that all the information effectively puts the "stars" associated with statistical significance in appropriate context: there is an incredible amount of variability in the size of correlations and distribution of data among all the "threestar" correlations, underscoring the limited usefulness of statistical significance as a tool for understanding the social reality given to us by data.
Monday, March 12, 2012
Taxes and Inequality
The economist Daren Acemoglu and his colleague James Robinson have an excellent article on the problems with inequality in the United States. You can find it here. In general, I agree with them entirely, and they are persuasive in outlining the negative aspects of political inequality.
Sunday, March 11, 2012
3D Scatter Plots Redux
One weakness of Stata versus R is the lack of 3D graphing capabilities, in particular 3D scatter plots. However, with some modifications, Stata can indeed provide a suitable substitute for R in most graphical problems, as shown here (I use the infamous auto data set available in Stata with the sysuse command). The main weakness is that the xy and yz
planes do not have grid lines; nevertheless, this graph is another indication
that Stata's graphing capabilities are much stronger than many R users (and perhaps even Stata users)
realize. Here's the graph:
Saturday, March 10, 2012
Checking Weather in Stata
I added a useful Stata command to my computer today: Neal Caren's weathr command in Stata (note that there is no "e"). The command is great: now you can check your day's weather entirely within Stata! The command obtains the current weather conditions and forecast for the next 36 hours from yahoo.com for any zip code in the United States.
Friday, March 09, 2012
Is Everything Culture?
In my readings on culture, I've found a fascinating set of theories called digital physics. These theories posit that the universe fundamentally consists of information (i.e., the "it for bit" doctrine that every particle, atom, quark, and so on is describable as a dichotomous "yes or no" categorization), and thus that the universe is in principle computable. Opponents to digital physics claim that reality is continuous, but the rejoinder is that reality only appears continuous, and is fundamentally categorical (for example, the Planck length suggests that reality is quantized). More relevant to sociology, these perspectives suggest that everything is culture  i.e., information  and thus that societies can be usefully modeled as information systems.
Thursday, March 08, 2012
Ternary (or Triaxial) Plots
One rarelyused graphic is the ternary (or triaxial) plot, which is a very useful way of examining a tripartite decomposition of a variable. For example, the graph in this post displays the composition (which I constructed in Stata using Nicholas J. Cox's commands) of an economy over time. Note that the three percentages add to 100 (or, equivalently, the three proportions add to 1).
It's a bit surprising that this graph appears so infrequently; it would appear to be especially useful for political scientists showing voting fractions over time (with the three most prominent parties for each axis), economists examining the composition of an economy (such as above), or sociologists examining overtime trends in any threepart categorical variable (such as "agree," "disagree," or "neutral" on a question of values or attitudes).
However, note that simply because a graph looks like it's a ternary plot does not make it one! For example, Junk Charts dissects this pseudoternary plot in the New York Times.
It's a bit surprising that this graph appears so infrequently; it would appear to be especially useful for political scientists showing voting fractions over time (with the three most prominent parties for each axis), economists examining the composition of an economy (such as above), or sociologists examining overtime trends in any threepart categorical variable (such as "agree," "disagree," or "neutral" on a question of values or attitudes).
However, note that simply because a graph looks like it's a ternary plot does not make it one! For example, Junk Charts dissects this pseudoternary plot in the New York Times.
Wednesday, March 07, 2012
Causality and Ethnography
The University of Chicago is hosting a conference on causality and ethnography on March 8th and 9th. Full details are available here. My own view on the relationship between causality and ethnography is that ethnographers should use counterfactuals, and in fact usually do whether or not they are explicit about them. In modern statistics (in particular, the work of Donald Rubin at Harvard,
among others, on the potential outcomes model), the counterfactual model of causaltiy clarifies the conditions
under which any particular data set can be interpreted as causal, and shows that these assumptions are extremely strong. Contra the prevailing view of many economists, even instrumental variables regression, regression discontinuity design, and related methods require exceptionally (and often implausibly) strong assumptions for causal interpretation.
Tuesday, March 06, 2012
The Mystery of PowerLaw Distributions
One criticism of sociology, and the macro social sciences more generally (such as political science, anthropology, and economics), is that there are very few "laws" of social reality. There are, however, some sociological regularities that are as yet not fully explained, and which seem bizarre. The most enduring and puzzling of these are powerlaw distributions (a wellknown special case of this is "Zipf's Law"), which is the fact that "large" instances of things are extremely rare, while "small" occurrences of things are extremely common (where size can refer to frequency in a population, population size, geographic space, and so on). In practice this means that a handful of words are much more frequent than other words (and most words are rarely used), wealth is concentrated in a small number of people (and most people are poor), there are a handful of really popular songs (and a vast number of unpopular tunes), and so on. Even the sizes of sand particles on a beach follow a powerlaw distribution: how often have you seen a boulder on a beach?
What might explain the ubiquity of powerlaw distributions? As far as I can tell, nobody is entirely sure, although we have some good guesses. For example, the sociologist Herbert Simon outlined a theory of preferential growth attachment (also known as the "rich get richer" effect), in which songs that are already fairly popular will become more popular, cities that are already large will become even larger, and words already used widely will become even more widely used. Note that this explanation hinges on a positive feedback effect: the probability that any thing gets "larger" is directly proportional to the current "largeness" of the thing; or, to put it another way, large values get amplified rather than cancelled out (as in a normal distribution).
Powerlaw distributions have important cultural, statistical, and political implications.
Culturally, there are several implications. First, most cultural constructs are rarely used and only a handful are common among any group of people. To put it another way, the shared part of culture is likely to be relatively small, while the particular part of culture is vast. Second, frequently used cultural constructs are particularly stable over time; that is, 500 years from the word "the" will still be used, while "sesquipedalian" has a more uncertain future. Third, the stability of a cultural system is derived from the more frequently used cultural constructs, while the dyanmism is among the less frequently used constructs. Fourth, initial conditions are extremely important for the frequency and hence durability of cultural constructs: for instance, small, random fluctuations led to the popularity of "the" in the English language. Finally, following from the previous point, the consequences of initial conditions are highly unpredictable; given small initial changes English speakers today might instead be using the word "tha" or "se" instead of "the."
Statistically, the presence of powerlaw distributions is a reminder that classical linear regression (based on the normal distribution) is not always the appropriate fit to a scatter plot of two variables, and that summarizing a distribution as a mean or median can be highly misleading.
Politically, powerlaw distributions have a unique implication for efforts to deal with wealth inequality: one effective way to alter the distribution of wealth is to remove the positive feedback effects from wealth. The desired distribution of wealth would thus be described by a normal rather than power law function. Importantly, removing the positive feedback effects of wealth would not lead to the removal of inequality, but rather a change in the distribution so that the mean, median, and mode are the same. From this perspective, policies should be in place so that (in principle) a person's change in wealth is independent of their current level of wealth. Such policies might include very high taxes on capital gains, restrictions on the influence of wealth in political decisionmaking, rules specifying equal monetary amounts from promotions for all occupational levels in a firm, and so on.
Powerlaw distributions have important cultural, statistical, and political implications.
Culturally, there are several implications. First, most cultural constructs are rarely used and only a handful are common among any group of people. To put it another way, the shared part of culture is likely to be relatively small, while the particular part of culture is vast. Second, frequently used cultural constructs are particularly stable over time; that is, 500 years from the word "the" will still be used, while "sesquipedalian" has a more uncertain future. Third, the stability of a cultural system is derived from the more frequently used cultural constructs, while the dyanmism is among the less frequently used constructs. Fourth, initial conditions are extremely important for the frequency and hence durability of cultural constructs: for instance, small, random fluctuations led to the popularity of "the" in the English language. Finally, following from the previous point, the consequences of initial conditions are highly unpredictable; given small initial changes English speakers today might instead be using the word "tha" or "se" instead of "the."
Statistically, the presence of powerlaw distributions is a reminder that classical linear regression (based on the normal distribution) is not always the appropriate fit to a scatter plot of two variables, and that summarizing a distribution as a mean or median can be highly misleading.
Politically, powerlaw distributions have a unique implication for efforts to deal with wealth inequality: one effective way to alter the distribution of wealth is to remove the positive feedback effects from wealth. The desired distribution of wealth would thus be described by a normal rather than power law function. Importantly, removing the positive feedback effects of wealth would not lead to the removal of inequality, but rather a change in the distribution so that the mean, median, and mode are the same. From this perspective, policies should be in place so that (in principle) a person's change in wealth is independent of their current level of wealth. Such policies might include very high taxes on capital gains, restrictions on the influence of wealth in political decisionmaking, rules specifying equal monetary amounts from promotions for all occupational levels in a firm, and so on.
Monday, March 05, 2012
Visualizing a Correlation Table
Correlation tables are ubiquitous in social science research, but very rarely they are visualized. As I've emphasized in previous posts, I'm a strong advocate for visualizing data and models whenever possible. For example, for my research I graphed correlations using Adrian Mander's plotmatrix command in Stata. Using Mander's package, I could create a graph that clearly shows all the information in a parsimonious way; moreover, unlike a correlation table, correlation patterns are intuitively grasped from the shading of the cells, and there is an implicit emphasis on the correlation size rather than statistical significance.
Sunday, March 04, 2012
Why Models are Not Data
In doing research, sometimes it can be easy to think that the models
one is using are in fact the data  but this is clearly not true. Even
the mean of a sample of data is a model of the central tendency of the
data, and not the data itself. One clear example of why models are not
data is Anscombe's quartet. For example, take the following:
What is remarkable about this quartet is that for all of these scatter plots the mean of x is the same (exactly), the variance of x is the same (exactly), the mean of y is the same (to two decimal places), the variance of y is the same (to three decimal places), the correlation between x and y is the same (to three decimal places), and the linear regression equation is the same (to two or three decimal places). In other words, the models of the data (e.g., mean, variance, correlation, etc.) are the same, but the data are not!
So what's the solution? As I've mentioned in previous posts, graphing the data is crucial, because we're forced to confront the actual data, and not models of the data.
What is remarkable about this quartet is that for all of these scatter plots the mean of x is the same (exactly), the variance of x is the same (exactly), the mean of y is the same (to two decimal places), the variance of y is the same (to three decimal places), the correlation between x and y is the same (to three decimal places), and the linear regression equation is the same (to two or three decimal places). In other words, the models of the data (e.g., mean, variance, correlation, etc.) are the same, but the data are not!
So what's the solution? As I've mentioned in previous posts, graphing the data is crucial, because we're forced to confront the actual data, and not models of the data.
Saturday, March 03, 2012
R versus Stata Redux
I've used both R and Stata for a long time, but these days I use Stata much more frequently than R. While R is useful for some kinds of graphics (especially threedimensional graphics) and some statistical procedures (for example, finite mixture models), in general I prefer Stata as the goto statistical program. The reasons are clear: Stata has superior help files for almost all ado files, Stata graphics are excellent (even contour plots are available in Stata), cleaning data is a breeze in Stata but awkward in R, labeling data is much efficient in Stata (in fact, as far as I can tell R does not allow for labeling variable names, while Stata allows for labeling levels of a variable, the variable itself, and the data set), and for many procedures Stata's syntax is much more parsimonious than R's.
Yet, R is worth learning because the 3D graphics available are often extremely useful for exploring the data, and there will certainly be cases in which R will have statistical procedures that are unavailable or cumbersome in Stata (Bayesian analyses and finite mixture models come to mind, for example).
Yet, R is worth learning because the 3D graphics available are often extremely useful for exploring the data, and there will certainly be cases in which R will have statistical procedures that are unavailable or cumbersome in Stata (Bayesian analyses and finite mixture models come to mind, for example).
Friday, March 02, 2012
Culture and Poverty
The New York Times has an article covering the concept of the culture of poverty here. The article is fairly accurate, and does a good job highlighting that the study of culture and poverty had its origins in leftwing Marxists (although I would have mentioned Bowles and Gintis, who emphasized that cultural values and norms of obedience to capitalist ideologies rather than intelligence contribute to the social reproduction of inequality). The author elides the fact that the problem with the concept of the "culture of poverty" is that such a thing does not, and never has, existed: culture is everywhere, not just among the a subset of the economically disadvantaged. The appropriate question, then, is: given that we know that culture is a constituent part of the human experience, how does it matter not just for poverty, but for happiness, wellbeing, inequality, wealth, and so on?
Thursday, March 01, 2012
Values and Politics
I'm a bit biased, but the front page of the Huffington Post highlighted a fascinating study on education, culture and politics today.
Wednesday, February 29, 2012
Reading the New York Times in Stata
One useful command for taking a break from research is Neal Caren's "nytimes" ado file. This command lists the most recent headlines with brief summaries from the New York Times. Best of all, no subscription is required!
Tuesday, February 28, 2012
Utility Theory as Naive Cultural Theory
Here's a fascinating presentation by the economist Steve Keen on utility theory and neoclassical economics. From the perspective of a cultural sociologist, what is of particular interest is that the utility theory underlying neoclassical economics has the appearance of a naive cultural theory. Specifically, the indifference curves that constitute supply and demand curves in neoclassical analysis are based on strong, disproved assumptions about how people value things in the world: first, completeness (i.e., that the individual knows their evaluative ranking of all combinations of things); second, transitivity (i.e., if thing A is valued to B, and B to C, then A is valued over C); third, nonsatiation (i.e., more things are always valued to less); fourth, convexity (i.e., for each thing, additional value falls); fifth, structural independence from culture (i.e., what an individual values is independent of how much income the have); finally, no curse of dimensionality (i.e., information processing abilities are unlimited). No cultural theory in sociology has even approached the disbelief required for these kinds of assumptions. Fortunately, some sociologists (for example Michael Hechter), have sought to correct this naive cultural theory, and have advocated eloquently and convincingly for a richer understanding of values in economic models of human behavior.
Monday, February 27, 2012
The Phil Gramm Effect
I recently reread Andrew Abbott's brilliant article on the problems with classical linear regression. One of the most persuasive criticisms is that statistical models are extremely difficult to use for examining small changes with big effects (but big changes with small effects can be modeled). I like to call this the "Phil Gramm Effect" because arguably one of the most important causes of the 2008 financial crisis (an undoubtedly big effect) was Phil Gramm (a small change), since he was the driving force for gutting the GlassSteagall Act and shifting government regulations in favor of private companies (often called "deregulation," but more accurately termed "reregulation").
Sunday, February 26, 2012
Big Science in Sociology
The search for the Higgs Boson particle has captivated a wide range of people all over the world, and the construction of the Large Hadron Collider is the reason for this widespread interest. Is such a "big science" approach possible in the social sciences, including sociology? Although the details to me seem obscure, researchers in Europe have developed a proposal for what they call the FuturICT, a "big science" project for the social sciences (ICT stands for "Information and Communication Technology") in the mode of the Manhattan Project, Apollo Project, and Large Hadron Collider. But what is it, exactly, that they are proposing? I get the sense it's a giant computer simulation, but it doesn't seem entirely clear.
Saturday, February 25, 2012
Social Learning is Efficient
I encountered this clever article by several social scientists, including the cultural anthropologist Rob Boyd. Through various data, they show that it is beneficial to copy others (i.e., engage in social learning) rather than innovate by oneself. This highlights clearly the fiction of the "selfmade" man, and the importance of one's cultural and social environment in leading to human flourishing.
Friday, February 24, 2012
3D Bar Graph "Masterpiece"
I encountered this post on how to turn a "boring" bar graph into a 3D "masterpiece." What's striking to me is that most of the people commenting actually want to replicate this graph, even though it violates the basic principles of effective statistical graphics, according to Tufte and others. For example, the 3D effect distorts the information displayed by the "boring" bar graph, making comparisons difficult, and the visualization effects distract from the underlying data as conveyed by the differing heights of the bars. Here's the "masterpiece" in its full glory:
Thursday, February 23, 2012
Violin Plots
Violin plots are an excellent way of displaying the distribution of a continuous variable by levels of a categorical variable. In essence, violin plots are box plots and kernel density plots combined. For instance, here are a set of violin plots from Stata's auto data:
These same data could also be displayed in tabular form, but again this is case in which a graphical display is a more effective way to examine and convey the patterns in the data.
These same data could also be displayed in tabular form, but again this is case in which a graphical display is a more effective way to examine and convey the patterns in the data.
Wednesday, February 22, 2012
Big Data and the End of Theory?
An article in The Guardian gives appropriate caution to claims that data analysis (and only data analysis) is the solution for all or even most academic and research problems. As Max Weber observed in his brilliant essay on objectivity in the social sciences, even the process of data analysis depends on values that cannot be empirically proven as right or wrong: "The 'objectivity' of the social sciences depends [..] on the fact that the empirical data are always related to those valueideas which alone make them worth knowing and the significance of the empirical data is derived from these valueideas. But these data can never become the foundation for the empirically impossible proof of the validity of the valueideas."
Saturday, February 11, 2012
Era of Big Data
The New York Times has a great article discussing the era of big data. This might have a Kurzweilesque ring to it, but due to technology change big data is becoming increasingly available and ready for analysis: in fact, there are more data sets out there than brains to analyze them, especially when one notes the incredible number of combinations of analyses that could be conducted even on a single data set with 100 variables (in what is known as the curse of dimensionality). However, one problem with big data is that, since so much of the data are collected by private entities, much of it may not be available to academics and independent researchers.
Thursday, February 02, 2012
Theory of Everything?
A new journal in biology called Life has published an unusual article in its inaugural edition: a paper by Erik Andrulis titled the "Theory of the Origin, Evolution, and Nature of Life." You can find the paper here. At 105 pages and 800 references, his paper seems Sokallike, except it apparently is not a hoax at all. As a result this paper is unusual, but especially so for two reasons: first,
Andrulis is apparently a wellrespected biological scientist who has
done important work on RNA, and second, Life appears to have all the trappings of a wellrespected, peerreviewed scientific journal, including a wellrespected editorial board.
In essence, Andrulis outlines a theoretical framework that (supposedly) unifies the microcosmic and macrocosmic realms, validates predicted laws of nature, and explains the origin and evolution of cellular life. Like most nonbiologists encountering this paper, I've only skimmed it, but apparently reality consists of geometric entities "gyres." Sounds good, except Andrulis provides no evidence (as far as I can tell) that these gyres exist.
It's easy to criticize this paper, if only for ambition of his theory. In one section he purports to unify all laws of nature, while in another he addresses the meaning of life. Even more astounding is the offhand way he presents his theory. For example, on page 55, Andrulis briefly remarks: "Please note the unity of reality and life as revealed by this theory." Can the unity of reality even be "noted"? However, my favorite part of the paper is on page 61, simply because of the sheer grandiosity of his assertion: "I refer the reader to the Theory section for a complete presentation of theoretical answers to many of science’s most challenging questions."
Questions abound how this paper was published despite peer review (perhaps it was a publicity stunt for the journal), and about the sanity of Erik Andrulis. From the position of a sociologist of culture, however, the more interesting question concerns why this paper was so heavily criticized, and whether or not papers such as these have a place in scientific journals. Andrulis' paper, I suspect, is filled with flaws and inconsistencies, but I contend there is often insight from theoretical frameworks that we "know" are generally wrong. Thus, the problem, from my perspective, is not that Andrulis wrote this paper, but rather that there is not a biological journal (to my knowledge) where scientists can publish speculative or halfformed theories that are probably "wrong" but nonetheless help us think about the world in a different way. (Sociology, in contrast, in part because of our methodological pluralism and historical connections with philosophy, has a number of journals in which theories, even those that are highly speculative, can be developed and publicized.)
In essence, Andrulis outlines a theoretical framework that (supposedly) unifies the microcosmic and macrocosmic realms, validates predicted laws of nature, and explains the origin and evolution of cellular life. Like most nonbiologists encountering this paper, I've only skimmed it, but apparently reality consists of geometric entities "gyres." Sounds good, except Andrulis provides no evidence (as far as I can tell) that these gyres exist.
It's easy to criticize this paper, if only for ambition of his theory. In one section he purports to unify all laws of nature, while in another he addresses the meaning of life. Even more astounding is the offhand way he presents his theory. For example, on page 55, Andrulis briefly remarks: "Please note the unity of reality and life as revealed by this theory." Can the unity of reality even be "noted"? However, my favorite part of the paper is on page 61, simply because of the sheer grandiosity of his assertion: "I refer the reader to the Theory section for a complete presentation of theoretical answers to many of science’s most challenging questions."
Questions abound how this paper was published despite peer review (perhaps it was a publicity stunt for the journal), and about the sanity of Erik Andrulis. From the position of a sociologist of culture, however, the more interesting question concerns why this paper was so heavily criticized, and whether or not papers such as these have a place in scientific journals. Andrulis' paper, I suspect, is filled with flaws and inconsistencies, but I contend there is often insight from theoretical frameworks that we "know" are generally wrong. Thus, the problem, from my perspective, is not that Andrulis wrote this paper, but rather that there is not a biological journal (to my knowledge) where scientists can publish speculative or halfformed theories that are probably "wrong" but nonetheless help us think about the world in a different way. (Sociology, in contrast, in part because of our methodological pluralism and historical connections with philosophy, has a number of journals in which theories, even those that are highly speculative, can be developed and publicized.)
Saturday, January 21, 2012
Murray on Cultural Inequality
The conservative sociologist Charles Murray has written a new book on cultural inequality, and he's written about his main arguments here in the Wall Street Journal. There are two glaring problems with his argument, however. First, although I appreciate his attempts to examine cultural factors of the economy, he frequently conflates behaviors with culture (which consist of values, attitudes, beliefs, not behaviors arising from these symbolic constructs). This muddles his argument, and leads to a profusion of of ad hoc claims that are weakly supported by the data, if at all. Second, his
explanation for cultural inequality falls short: in particular, he ignores how lack of public investments and conservative economic policies (for example, lack of investment in public transportation, public spaces, universal welfare systems, and the growth of carbased urban sprawl based on the profitmaking concerns of private developers, among other things) are leading causes of the cultural fragmentation he is concerned about.
explanation for cultural inequality falls short: in particular, he ignores how lack of public investments and conservative economic policies (for example, lack of investment in public transportation, public spaces, universal welfare systems, and the growth of carbased urban sprawl based on the profitmaking concerns of private developers, among other things) are leading causes of the cultural fragmentation he is concerned about.
Thursday, January 12, 2012
Inequality versus Dispersion
I'm glad to see that Alan Krueger, chairman of the Council of Economic Advisers (a fancy name for a panel of three economists), discussed the problems with inequality in his address today. You can find his remarks and graphs here. I liked his graphs, and he shows convincingly many of the standard findings in sociology and political science on politics and inequality in the United States. However, I found the following comments puzzling:
Although I have done much research in my career on inequality, I used to have an aversion to using the term inequality. The Wall Street Journal ran an article in the mid1990s that noted that I prefer to use the term “dispersion.” But the rise in income dispersion – along so many dimensions – has gotten to be so high, that I now think that inequality is a more appropriate term.The mixing of the statistical concept of dispersion with the sociological concept of inequality muddles the discussion. It's true that any distribution is often described by some measure of dispersion (e.g., standard deviation) and central tendency (e.g., mean or mode). But inequality encompasses a concept of equity, as well as some concept of disparity (or disparities), neither of which is analogous to the statistical concept of dispersion. Moreover, if we use Krueger's logic it's unclear at what threshold "dispersion" is labeled "inequality"; for instance, his comments imply that Sweden currently has dispersion, while the United States has inequality, although many Swedes would probably disagree.
Tuesday, January 03, 2012
Congratulations to the Digging into Data Recipients
The list of the round two award recipients for the 2011 Digging into Data challenge are listed here.
Subscribe to:
Posts (Atom)
Blog Archive

▼
2012
(56)

►
March
(29)
 Physics Envy
 Irving Louis Horowitz
 MyPersonality
 Why are Economists so (Consistently) Led Astray Ab...
 Popularity of Programming Languages
 Big Science and Sociology
 Statistical Lexicon
 McKinsey on Big Data
 Inequality: Everyone's Thinking About It
 Universal Limits in HighDimensional Statistics
 Rethinking Tragedy and Success
 Why Inequality Matters
 Inequality "Crisis" of Marriage
 Corporate Culture Revisited
 Misc. Links
 MIT Inequality Talk
 Scatter Plot Matrix in R
 Taxes and Inequality
 3D Scatter Plots Redux
 Checking Weather in Stata
 Is Everything Culture?
 Ternary (or Triaxial) Plots
 Causality and Ethnography
 The Mystery of PowerLaw Distributions
 Visualizing a Correlation Table
 Why Models are Not Data
 R versus Stata Redux
 Culture and Poverty
 Values and Politics

►
March
(29)