Monday, April 21, 2014

It's Official: Statistics is "Sexy"

If you've seen the new Captain America movie, you might notice that statistics (and data mining more generally) are featured prominently in the film. I can't imagine a more remarkable shift in the perception of statistics, which has historically been claimed to be "dull" or "boring" (a view that is at odds -- pun intended -- with that of any practicing statistician, past or present). In fact, in a 1998 talk the statistician C.F. Jeff Wu even argued that "statistics" should be replaced with the phrase "data science" in part to remove the negative connotations with data analysis and statistical theory!

Yet now more and more people are realizing that statistic is "hot," as exemplified in the following clip, in which Scarlett Johansson is suggesting that the superhero Captain America go on a date with -- yes! -- a statistician: 


And of course if a movie trailer isn't convincing enough to you of how the public perception of statistics has been shifting, I refer you to the Chief Economist for Google, Hal Varian, who has been saying (correctly) for years that statistics is the "sexy" dream job of the 2010s:

Sunday, April 20, 2014

Python in R: Examples

How to call Python within R in Windows? This is a project I'd like to dedicate myself once I have more time, since every native R user would love to have at least pseudo-connectivity with Python.

Here's a short overview of how to run some Python code in R:

# (1) basic python commands called from R
system('python -c "a = 2 + 2; print a"') 
system('python -c "a = \'hello world\' ; print a; import pandas"')

# (2) if you have a python file you've already created (which I've 
# referred to as "my.py"), then you can run it in R as follows:
system("python C:\\Users\\Name\\Desktop\\my.py")

# or alternatively:
system('python -c "import sys; sys.path.append(\'C:\\Users\\Name\\Desktop\');
import my;"')

Saturday, April 19, 2014

Class-Conditional Response Probabilities

One issue that I've been trying to resolve is how to graph class-conditional response probabilities manually using the package 'poLCA' in R. I've figured out one approach using the code below:

# extracting response probabilities
R <- br="" lc="" length="" p="" probs="" ti="" y="">R <- matrix="" nrow="length(probs),ncol=R)<br" pi.class="" probs="">for (j in 1:length(probs))
  pi.class[j,] <- br="" category="" first="" for="" j="" probability="" probs="">dimnames(pi.class) <- br="" list="" names="" round="" y=""># if you want to specify your own rownames: rownames(pi.class) <- br="" row.names="">
# extracting standard errors
probs.se <- lc="" matrix="" nrow="length(probs.se),ncol=R)<br" probs.se="" se.class="">for (j in 1:length(probs.se))
  se.class[j,] <- br="" category="" j="" level="" probs.se="" specifies="">dimnames(se.class) <- br="" list="" names="" round="" y=""># if you want to specify your own rownames: rownames(se.class) <- br="" row.names="">
## creating an augmented dataset
# class-conditional probabilities and standard errors
df.probs <- data.frame="" lasses="as.vector(col(pi.class)),<br">                            Manifest.variables=as.vector(row(pi.class)),
                            value=as.vector(pi.class),names=rownames(pi.class),
                            se=as.vector(se.class))

## (1) LINE PLOT (No Std. Errors): line plot of latent classes
win.graph()
p <- aes="" br="" df.probs="" ggplot="" x="factor(Manifest.variables)," y="value,">                          color=factor(Classes)))
p + geom_freqpoly(stat="identity",aes(group=Classes)) + #NB!!!
  geom_point(stat="identity",aes(group=Classes)) + #NB!!!
  scale_color_hue(name="Latent Class") + xlab("Manifest Variables") +
  ylab('P(Y = "Too Little")') +
  ggtitle("Class-Conditional Response Probabilities by Latent Class") +
  theme_bw() + scale_x_discrete(labels=unique(df.probs$names)) +
  coord_flip()
  # to add variable names manually for the manifest variables:
  # + scale_x_discrete(labels=c(""))
dev.off()

## (2) RIBBON PLOT (has Std. Errors): ribbon plot of response probabilities
# (with standard errors) using ggplot to graph the predicted probabilities

df.probs$lower <- -="" br="" df.probs="" se="" value="">df.probs$upper <- br="" df.probs="" se="" value="">df.probs$Classes <- br="" df.probs="" factor="" lasses="">
# using ggplot to graph the predicted probabilities
win.graph()
ggplot(df.probs, aes(x = Manifest.variables, y = value, group=Classes)) +
  geom_ribbon(aes(ymin = lower, ymax = upper, fill=Classes),
              alpha = 0.2) +
  geom_line(aes(colour = Classes), size = 1) + theme_bw() +
  ggtitle("Class-Conditional Response Probabilities by Latent Class") +
  xlab("Manifest Variables") +   ylab('P(Y = "Too Little")') +
  scale_fill_discrete("Latent Class") +
  scale_linetype_discrete("Latent Class") +
  scale_shape_discrete("Latent Class") +
  scale_colour_discrete("Latent Class") +
  scale_x_discrete(labels=unique(df.probs$names)) + coord_flip()
  # to add variable names manually for the manifest variables:
  # + scale_x_discrete(labels=c(""))
dev.off()

Wednesday, April 09, 2014

Twitter Extraction

For several weeks I've been working on examining Tweets using code from R. Here's one approach to analyzing Twitter feeds:

# see how many unique Twitter accounts in the sample
length(unique(df$screenName))

# Create a new column of random numbers in place of the usernames and redraw the plots
# find out how many random numbers we need
n <- br="" df="" length="" screenname="" unique="">
# generate a vector of random number to replace the names (four digits just for convenience)
randuser <- 1000="" 9999="" br="" n="" round="" runif="">
# match up a random number to a username
screenName <- as.character="" br="" df="" sapply="" screenname="" unique="">randuser <- br="" cbind="" randuser="" screenname="">
# Now merge the random numbers with the rest of the Twitter data, and match up the correct
# random numbers with multiple instances of the usernames:
rand.df  <- br="" by="screenName" df="" merge="" nbsp="" randuser="">
# determine the frequency of tweets per account
counts <- br="" rand.df="" randuser="" table="">
# create an ordered data frame for further manipulation and plotting
countsSort <- br="" data.frame="" user="unlist(dimnames(counts)),">                         count = sort(counts, decreasing = TRUE), row.names = NULL)

# create a subset of those who tweeted at least 5 times or more
countsSortSubset <- count="" countssort="" subset=""> 0)

## extract counts of how many tweets from each account were retweeted
# (1) clean the twitter messages by removing odd characters
rand.df$text <- br="" function="" iconv="" rand.df="" row="" sapply="" text="" to="UTF-8">
# (2) remove @ symbol from user names
trim <- br="" function="" sub="" x="">
# (3) pull out who the message is to
rand.df$to <- br="" function="" name="" rand.df="" sapply="" text="" trim="">
# (4) extract who has been retweeted
rand.df$rt <- br="" function="" rand.df="" sapply="" text="" tweet="">trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))

# (5) replace names with corresponding anonymising number
randuser <- br="" data.frame="" randuser="">rand.df$rt.rand <- as.character="" br="" match="" rand.df="" randuser="" rt="">                                                         as.character(randuser$screenName))]

# (6) make a table with anonymised IDs and number of RTs for each account
countRT <- br="" rand.df="" rt.rand="" table="">countRTSort <- br="" countrt="" sort="">
# (7) subset those people RT’d at least twice
countRTSortSubset <- countrt="" countrtsort="" subset="">2)

# (8) create a data frame for plotting
countRTSortSubset.df <- br="" data.frame="" user="as.factor(unlist(dimnames(countRTSortSubset))),">                                  RT_count = as.numeric(unlist(countRTSortSubset)))

# (9) combine tweet and retweet counts into one data frame
countUser <- br="" by.x="randuser" by.y="user" countssortsubset="" merge="" randuser="">TweetRetweet <- br="" countrtsortsubset.df="" countuser="" merge="">                      by.x = "randuser", by.y = "user", all.x = TRUE)

# (10) creating a random subset for the graph below
TweetRetweet.sub <- font="" tweetretweet="">

Friday, September 13, 2013

The Troubled Future of Higher Education

The political scientists Gary King and Maya Sen have just posted an excellent working paper clearly outlining the major problems facing higher education: economic, political, sociological. The main thrust is that, although only 30% of the American population obtains a four-year college degree (thus leaving an untapped 70% who could finish college degrees), the higher education system is facing major constraints due to limited budgets and major technological advances. For example, online sites such as Khan Academy are effectively competing with universities, and for-profit universities are growing at a high rate. I'd add to their list the potential for big data analysis to displace the role of experts; I refer to the effect of sabermetrics on baseball journalists or data mining algorithms on marketers as possible canaries in the cage for academics. Regardless, King and Sen's paper is a must-needed beginning of a discussion about the future of higher education in the wake of profound social changes. After all, it was only a mere decade ago that Time and Newsweek were major cultural institutions in American life.

Thursday, May 31, 2012

Using Indirect Survey Techniques to Measure Zombie Outbreaks?

Zombies are now a common topic of discussion. In fact, the data we have available from Google Trends (for the phrase "zombie attack") strongly suggest an increasing risk of zombification across the world:


 However, academic research on zombies is limited (i.e,. non-existent), mainly because of the lack of high quality data. For those interested in studying zombies, I refer readers to Andrew Gelman's paper (co-written, apparently, by the great zombie film director George Romero) on how to measure zombie outbreaks via indirect survey techniques. You can find his article here. Even if you're not interested in zombies, his paper offers some good ideas on how to sample difficult-to-reach populations more generally.

Friday, May 11, 2012

The Promising Future of Mathematical Sociology

I'm now an occasional blogger at Permutations, the official blog of the Mathematical Sociology Section of the American Sociological Association. You can read my blog post here, in which I outline why I think global trends in information technology and the meta-theroetical foundations of sociology provide conditions for a promising future for sociology in general and mathematical sociology in particular.

Thursday, May 10, 2012

90+ Two-Minute Videos on R

I highly recommend Anthony Damico's excellent two-minute videos on programming in R. You can find the full list of 90+ videos here. This is the first of the series, which tells you how to download and install R:


More generally, Anthony's video collection is another reminder of the immense sociological benefits that come from sharing educational materials and expert knowledge in the style of the Khan Academy.

Tuesday, May 08, 2012

Global Online Conference on Statistics

The Consortium for the Advancement of Undergraduate Statistics Education is hosting a global online conference titled "eCOTS: Electronic Conference on Teaching Statistics." You can view the full program here. It only costs $15 to register and participate in the online conference. For at least the past five years I've thought that conferences are obsolete in many respects, so I'm delighted to see this conference developed. By not having a physical place, with food, beverages, and equipment, not to mention lodging and transportation costs, the costs of attendance are much lower, thus enabling more and more people to learn and contribute to knowledge production. (Of course, we'll still want some conferences for face-to-face socialization!)

Sunday, May 06, 2012

I've Converted to R Full-Time

It's been over four years that I've been using both R and Stata, but as of last week I've become an R convert. For several years I had conducted statistical analyses in R (since many complex models can only be programmed in R), but I used Stata before and after the analyses. In essence I'd merge and clean data sets in Stata, call R from Stata for the statistical analyses, export R objects into Stata, and then use Stata's graphics utilities to display the results. This setup quickly unraveled last month when I began merging and recoding data in R, which  is much aided by John Fox's fantastic "car" package.

The problem is that if you want to do Bayesian analysis or graph modeled coefficients (or work with complex data structures more generally), then R is much easier than Stata due to the object-oriented programming environment. It's unbelievably liberating to be able to save vectors, matrices, data frames, and so on from multiple data sources and manipulations in the same conceptual space. Additionally, R has fantastic graphics capabilities (3-D plots, rotating hyperplanes, social network graphs, and so on), offers excellent tools for analyzing and displaying so-called big data (for example, check out the "tabplot" command from Google), and is (frankly) a fun, intuitive programming language. If you need additional reasons to be an R convert, keep in mind that R is completely free, open-source, and extensible, with over 5,300 statistical packages (as of April 2012).