Big data is a – perhaps the – hot topic in education. Over the summer I attended two conferences with presentations on big data, data mining and predictive analytic applications in higher education. This Fall I’ll be taking a MOOC on Big Data in Higher Education. My early take is that there is some big data medicine for education but it’s mixed in with plenty of snake oil.
One issue is that big data, data mining and predictive analytics mean different things to different people. Big data most commonly refers to datasets too large for analysis with standard statistical and analytical tools. In practice this generally means data with sizes measured in terabytes or petabytes. One example of this kind of data is the records of browsing patterns recorded by online retailers which are then used to generate suggestions for further purchases for individual customers. When every click on a widely used website – think Amazon.com – becomes a data point, the petabytes add up quickly.
Data mining is a term for various techniques used to analyze these large datasets. Predictive analytics is best thought of as a subset data mining focused on making predictions. Many data mining techniques developed for analyzing big data are also useful for data that is less big but very complex. This is where big data comes into higher education where we tend to have moderately sized data sets representing very complex behavior.
Big Claims for Big Data
Big data promoters have made big claims. Perhaps the biggest, made by Chris Anderson in wired magazine and echoed by George Siemens at one of the conferences that I attended, is that big data allows researchers to abandon theory, scientific method and a concern with causation. As Anderson puts it:
There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
This position fails to differentiate between description, prediction and intervention as objectives of inquiry. This matters because the importance of theory, scientific method and causation are different for these objectives.
Some of Anderson’s examples of the power of big data and data mining are in fact descriptive. He makes a great deal of Craig Venter and the sequencing of the human genome.
The best practical example of this is the shotgun gene sequencing by J. Craig Venter. Enabled by high-speed sequencers and supercomputers that statistically analyze the data they produce, Venter went from sequencing individual organisms to sequencing entire ecosystems. In 2003, he started sequencing much of the ocean, retracing the voyage of Captain Cook. And in 2005 he started sequencing the air. In the process, he discovered thousands of previously unknown species of bacteria and other life-forms.
The sequencing of the Human Genome was a tremendous scientific achievement and Venter’s use of shotgun sequencing was bold and unconventional and it did capitalize on big data type analytics. However, it was always a fundamentally descriptive project. What Venter points to is the continued importance of description without causation but this is not, in fact, particularly new. Description without causation or scientific method has a long history and a vibrant present. Computers make it faster, and definitely make new kinds of description possible, but this is not a fundamentally new way of doing science. Moreover, using the recently described human genome to do things like build new drugs or gene therapies will require continued to reliance on experimentation, scientific method, theory and a concern with causation over correlation.
Education needs better descriptive analytics. Consider enrollment as an example. At the community college we know that the idea of being a “two year” college is something of a fiction. A very small fraction of our students go full time for two years with no stopping out. While we know that this “ideal” does not reflect reality, we continue to report graduation rates (among other things) as if the two year fiction were reality. We know it’s a fiction but we don’t have a better approximation to reality. Getting a better approximation is difficult because our students follow so many different paths through our institutions. Big data analytics as used by Peter Crosti may help. I am hoping to take the big data plunge by replicating this study at my own institution.
Correlation is to causation as prediction is to intervention
Many proponents of predictive analytics seem to assume that interventions follow naturally and obviously from predictions. They don’t. Predictive analytics may indicate which students are likely to struggle but that isn’t the same as, and may be a long way from, knowing what to do about it. Intervention is actually far more difficult than prediction.
One big data application that I heard more than once from big data promoters were suggestions made by Amazon and Netflix. Both sites use customer viewing and past purchasing data to suggest additional items. This kind of big data analysis, which is commonly used as an example of the power of big data, exposes the prediction/intervention problem.
When Amazon predicts that I will like a product, they suggest it to me. In other words, they suggest that I do something that their data indicates I am predisposed to do. In education, when our data predict that a student will struggle we want to intervene in a way that changes the outcome. We want our students to do something that they are predisposed not to do. It is as if, having discovered that I am predisposed to want to buy Fifty Shades of Grey, Amazon intervenes and gets me to buy Pilgrim’s Progress instead. My concern with big data in education is that the model that figures out what I want won’t necessarily shed light on how to change what I want and that is what we need in education. Predicting an outcome isn’t the same as intervening to change that outcome and interventions that attack correlates rather than causes are not likely to succeed.
Data data everywhere, but first let’s stop and think
When you ask a big data person what data to collect they frequently say something like “All of it.” That response implicitly assumes that all data are equally valid, equally costly, and equally available. Those assumptions are demonstrably false. There is a substantial danger to analyzing only the data you have without giving thought to the data that you need. In addition, data collection is a theoretically driven activity. We collect some data and not other data because theories tell us that some data are relevant and others are not.
The anti-theory approach of big data promoters leads quickly to the false assumption that any data will do or that whatever data you have is good enough. All of the data comes to mean not “all of the data that is relevant” but “all of the data that you have.” We should worry about this a lot because the data we have tends to be the data that is convenient to collect. This compounds the prediction intervention problem that I described above.
The importance of having the correct data was evidenced in both conferences that I attended this summer. Each had a practical demonstration of the power of big data analytics in higher education and both demonstrations included High School GPA in their prediction models and in both cases it was a very important predictor. Unfortunately my institution does not collect that data. All of the analytical models presented will run and produce results without GPA. But none of them will tell me that my results would be more useful if I had GPA data. That is a conclusion from theory.
A Case Study
The well-chosen case is a standard rhetorical technique. A very frequently mentioned example of the power of predictive analytics in education is Rio Salado College. I have heard, read, and been told that within eight days from the start of term Rio Salado can “predict with 70% accuracy students who _____________.” What fills in the blank varies from “were at high risk not to be successful in a course” to “score a C or better” and would “drop a course”.
The visual data expert Edward Tufte once wrote: “At the heart of quantitative reasoning is a single question: Compared to what? “ When I mentioned these data to a colleague she was quite confident in her ability to divide her students in to three groups after eight days with similar predictive power. What is the actual value added to the instructor over and above their regular interactions with the student? 
What factors has Rio Salado discovered contribute to student success?
As we crunched data from tens of thousands of students, we found that there are three main predictors of success: the frequency of a student logging into a course; site engagement–whether they read or engage with the course materials online and do practice exercises and so forth; and how many points they are getting on their assignments (source)
In other words:
- Show up to class (logging into a course)
- Do the assigned work (read or engage with the course)
- Do well on assignments (points on assignments)
Sound advice but hardly novel and, to my mind, a low return on data crunched from tens of thousands of students.
From Prediction to Intervention
All of this would be fine if Rio Salado were able to develop substantial interventions that produced success out of these findings. So, given their findings, what did they do to improve success?
Early data showed students in general-education courses who log in on Day 1 of class succeed 21 percent more often than those who don’t. So Rio Salado blasted welcome e-mails to students the night before courses began, encouraging them to log in. (source)
This example nicely encapsulates my skepticism. Does logging in on Day 1 really cause success or is wanting to log in on day 1 the product of some unobserved set of characteristics that produce success. If we nag people into logging in on Day 1 can we realistically expect them to perform in the same way as the unnagged student? Is there any theoretical reason to think that logging early is related to success? Most importantly what is the causal mechanism by which logging on early produces success? In other words how does logging on early lead to success? Without answers to those questions focusing on getting students to log in more often is likely to be a waste of time.
I would be more comfortable with the correlational emphasis among big data proponents if they showed more concern with evaluating their correlational data through the lens of, at least, plausible causal mechanisms. Unfortunately
The hope, [a Rio Salado instructor] says, is that a yellow signal might prompt students to say to themselves: “Gosh, I’m only spending five hours a week in this course. Obviously students who have taken this course before me and were successful were spending more time. So maybe I need to adjust my schedule. (source).
Perhaps. Though assuming that the difficulties of community college students result from their own lack of knowledge about what they need to do (show up, do the work, and do well on the work) the expectation that they will immediately reach this conclusion is, to my mind, optimistic. Also, assuming that they can “adjust my schedule” with ease, when we know those schedules often involve young children and work, is unrealistic. Frankly, if the intervention that we anticipate developing from predictive analytics is to simply inform students about what we have found, we should probably stop now.
A Modest Proposal
I do see one very important possibility for predictive analytics at Rio Salado. Currently they can make accurate predictions with eight days of data. If they could push that back to six days they could make their predictions prior to deadline for a full refund which could conceivably save students a great deal of money and financial aid eligibility. That, I think, might contribute a great deal towards success.
Students who enrolled in X . . .
The suggestions made by Amazon and Netflix already have an educational correlate in the degree compass system in use at Austin Peay. This system:
Uses predictive analytics techniques based on grade and enrollment data to rank courses according to factors that measure how well each course might help the student progress through their program. From the courses that apply directly to the student’s program of study, the system selects those courses that fit best with the sequence of courses in their degree and are the most central to the university curriculum as whole. That ranking is then overlaid with a model that predicts which courses the student will achieve their best grades. In this way the system most strongly recommends a course which is necessary for a student to graduate, core to the university curriculum and their major, and in which the student is expected to succeed academically. (source)
The use of grades raises a number of questions such as:
- How, if at all, does the program account for variation in grades across professors teaching the same class or across disciplines?
- If it does include this variation, will this system drive grade inflation as it suggests teachers, classes and disciplines with higher average grades?
- How will this system affect aggregate enrollment? 
- Should we discourage students from taking classes just because they might get a poor grade? How does this fit with concerns about educational standards?
- Does this approach encourage specialization thereby automating a retreat from a liberal arts approach to college education?
- How does the system deal with students whose program choice leads to a low prediction of success? Put differently, is the next step to suggest degrees, certificates and courses of study? If so, is this the first step in automating what Burton Clark called the cooling out function?
Automated course suggestions might be worthwhile regardless of the answers to these questions. My concern is that these questions aren’t addressed in discussions of big data applications.
Don’t Believe the Hype
I’ll finish with a small observation that demonstrates, at least to me, the ways in which “revolutions” repackage and take credit for what already exists. After some minutes sermonizing on the fundamental changes wrought by big data, the cloud, and the application of predictive analytic techniques, the Amazon Web Services Evangelist gave the example of using analytics to discover that “fulfillment center” workers could move more efficiently in the warehouse substituting two steps for the ten that they had previously taken, thereby improving workers lives and productivity.
Whatever the value of such findings – and regardless of who benefits from them – they are unrelated to big data, predictive analytics, the web, the information revolution or even the computer. Scientific management, Taylorism and time-motion study) are more than one hundred years old and a feature of the industrial assembly line rather than the information age supply chain.
Whether the assembly line and scientific management are good models for education and, perhaps more importantly, whether the “information revolution”, “big data”, and “predictive analytics” are stalking horses for the scientific management of education are separate question about which people of good will can disagree. What we should all recognize and agree on is that we should not throw theoretical, causal, and experimental babies out with the big data bathwater.
 A comparison might make the differences of scale clear. At a very large college 30,000 students might enroll in 12 courses a year for a total of 360,000 selections. If we had 100 pieces of information on each enrollment we would have 36 million data points. This is a lot of data but it isn’t “big” data. In contrast, Google processes about a billion searches a day. Multiply that by 100 pieces of information and you have big data. The process – people making choices that are recorded – is similar but the scale is very different. What is true in education is that the complexity of the data is very high. We have an ideal typical conception in which students at four-year colleges attend for eight consecutive full time semesters (or 12 quarters) but this ideal type describes hardly any real college students and very few community college students. Enrollment data isn’t big but neither is it simple.
 For example economists modest ability to predict economic trends dwarfs their ability to change economic trends. Put differently, predicting that a company will not be profitable is far easier than intervening to make that company profitable.
 According to Rio Salado’s accreditation self-study they used naïve Bayes classification method* to divide students into hi moderate and low risk groups. The college then found that “The mean success rate was approximately 70% in the Low warning group, 54% in the Moderate warning group, and 34% in the High warning group.
* If you aren’t familiar with Bayesian methods Nate Silver’s new book The Signal and the Noise provides an excellent non-technical discussion.
 Envisioning Information p. 67
 Rio Salado is largely online so the interactions between students and teachers is different, but in some sense this technical revolution is only solving problems of its own making. The need to analyze data arises from the lack of direct contact between students and teachers.
 Given all the computing power involved in big data applications it should be possible to simulate future enrollment based on students taking degree compass suggestions. I would be interested in seeing that simulation
 We may or may not want to retreat in this manner, but we shouldn’t do it through a technological default.
 I wrote my master’s thesis on the politics at the Ford fulfillment center featured in this video.