Reflections on the application of big data: Is big data omnipotent? Do not superstitious about big data!
IBM defines big data as a term applied to data sets whose size or type is beyond the ability of traditional relational database to capture, manage and process the data with low latency and characteristics of big data include high volume, high velocity and high variety. As the application of big data becomes more and more widespread, the main challenges associated with big data analysis are data quality and its reliability.
Big data as a disruptive method
In era of “Web 3.0”, big data has become a prevalent fashion. With the help of data, researchers can supervise the trend of the Covid-19 and financial trader can make decision more scientifically. Meanwhile, data inform what we know about the universe, and they help indicate what is happening to the earth’s climate (Lisa, 2013).
What’s more, as an interdisciplinary approach, big data calculation is subverting the concept of scientific research. A study found that 59% of the conducted research used quantitative methods (Peng et al., 2013). A new trend in quantitative Internet Studies is big data analytics that has a focus on collecting large amounts of data from social media platforms and analysing it in a predominantly quantitative manner (Fuchs, 2017). For example, computational advertisement which is a decision-making model based by algorithm and computation takes into account the context and users’ characteristic. Through big data analysis and algorithmic calculations, in just a few minutes, it is possible to infer how effective the advertisement is. This completely subverts the research methods on advertising effects.
Is big data omnipotent?
However, is big data omnipotent or is big data just a utopian? We often think that numbers are objective and it can reflect the phenomenon and uncover the essence of things comprehensively. Through the analysis of numbers, the final result will be the most scientific. But reality always gives us a heavy blow. Working with big data is still subjective, and what is quantifies does not necessarily have a closer claim on objective truth (Boyd, d., Crawford, K, 2012).
Technology companies like Google announced that by data collecting and modeling they can predict movie box-office revenue of the first week of the movie’s release one month in advance, with an accuracy of 94%. The movie “The Continent“ (2014) shot by the young Chinese director Han Han purchased a similar data forecast service provided by ABD, which is a leader in China’s big data prediction industry. But there was a big deviation between the box-office value forecast of “The Continent” and its actual market performance. The data the box-office value estimated by the model is 430 million to 480 million, and the actual box-office value of this movie exceeded 600 million.
Big data also promises a multitude of innovative options to enhance decision-making by employing algorithm to gather worthy information out of large unstructured data sets (Strauß, 2015). Under this background, a lot of streaming media such a Netflix try to pander to people’s preference before making a TV series in order to maximize revenue. Netflix analyzed user’s like to film the “House of Cards” and it received a huge success. But after that, other series did not perform well in the market. From the perspective of the Rotten Tomatoes Index, the overall trend is downward.
So why big data sometimes can not bring what we want?
Reasons behind big data failure
There are several factors affect the accuracy of big data analysis: inauthentic data collection, information incompleteness and noise of big data, unrepresentativeness, consistency and reliability, and ethical issues (Liu et al, 2016). In this article, I will discuss the following four factors.
1.Human build the model
Though software fault is concerned primarily with process, design, and data (Sharma, Kumar and Kaswan, 2021), behind the hardware and the model, it is human who create the whole analysis system. Unlike a real machine, people have their own subjective consciousness and preference, both of these will affect the establishment of the model and the selection of analysis methods. And the setting of the weights of different factors also will affect the final prediction effect of big data.
2.Errors brouht by applicability of model
In the Google box-office value analysis model mentioned before, the basis factors includes: the searches for movie and number of movie advertisements (one week before the movie’s screening), the box office performance of the first few movies of the same series, the number of theaters released, and seasonal characteristics of the schedule. After obtaining these indicators for each movie, Google built a linear regression model to establish the relationship between these indicators and box-office revenue (Wang et al, 2016). This model is indeed applicable in the western film market, especially Hollywood. However, it is not very suitable for China. Gong Yu (2014), CEO of iQiyi, which is one of the leading movie and video streaming website in China, pointed out that when put the data collected by Baidu and iQiyi into Google’s model, the accuracy rate is very low, indicating that in addition to these factors, the Chinese market may have other factors. It is not difficult to obtain data. The difficult thing is how to establish a suitable analysis dimension.
3.Unavailable data and dirty data
Although big data contain information in a finer grained and detailed manner, they also record random variations, fluctuations, and even noise during the measurement (Liu et al, 2016). Dirty data or noise always exist in the Internet. In social media such as WeChat and Weibo, lack of emotion, unclear semantics and unclear themes may bring noise to the data capture. Even traditional sampling surveys will inevitably encounter noise and be disturbed. The only way to deal with this problem is to filter the noise data as much as possible, while continuously correcting the model, and increase the error range of the prediction results.
Additionally, data gap is the other problem. No company can promise it has all the data to make predictions. Although through cooperation this problem can be alleviated, it is still unrealistic for every company to break through the data barriers.
Due to the excessive reliance and blind worship on data analysis, click-through rate fraud has become a hidden rule in China, resulting a vicious circle. Since directors and producers care about the data performance of film and television works, fans will choose to falsify data to help their idols to obtain better data performance and commercial value. According to this kind of false data, directors and producers cannot make the right decisions even if they use the most accurate models.
Predicting the future, this is the most fascinating aspect of big data. If we can accurately predict the box office, it will be really easy to get more advertising placements and control the cost of advertising. However, as Galen Panger (2015) said, we have only begun to grapple with the many complexities of research involving big data. There is still a long way to go. Meanwhile, because big data is based on the summary of past experience, which is difficult to predict completely new things, sometimes we should give up our obsession with big data and hold a critical attitude towards big data.
Boyd, d., & Crawford, K. Critical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon. Information, Communication & Society, 2012. 15(5), pp: 662-679. https://doi.org/10.1080/1369118X.2012.678878
Christian Fuchs. From Digital Positivism and Administrative Big Data Analytics towards Critical Digital and Social Media Research!. European Journal of Communication, 2017. 32(1): 37-49. https://doi.org/10.1177/0267323116682804
Daosen Zheng. Don’t completely deny the value of big data to the film industry just because of the failure of the box office forecast. ENTERTAINMENT CAPITAL, 2014.10.23. https://www.tmtpost.com/162475.html
Galen Panger. Reassessing the Facebook experiment: critical thinking about the validity of Big Data research. Information, Communication & Society, 2015. 19(8): 1108-1126. https://doi.org/10.1080/1369118X.2015.1093525
Gitelman, L. (Ed.). “Raw Data” Is an Oxymoron. Cambridge: MIT Press, 2013. Introduction chapter.
IBM. Big Data Analytics. https://www.ibm.com/analytics/hadoop/big-data-analytics
Liu, Li, et al. Rethinking big data: A Review on the data quality and usage issues. International Society for Photogrammetry and Remote Sensing, 2016. 134-142. https://doi.org/10.1016/j.isprsjprs.2015.11.006
Peng T-Q, Zhang L, Zhong Z-J, et al. Mapping the landscape of Internet Studies: Text mining of social science journal articles 2000–2009. New Media & Society, 2013. 15(5): 644–664. https://doi-org.proxy.uba.uva.nl/10.1177/1461444812462846
Shalini Sharma, Naresh Kumar and Kuldeep Singh Kaswan. Big data reliability: A critical Review. Journal of Intelligent & Fuzzy System, 2021. 40: 5501-5516. https://doi.org/10.3233/JIFS-202503
Stefan Strauß. Datafication and the Seductive Power of Uncertainty—A Critical Exploration of Big Data Enthusiasm. Information, 2015. 6(4): 836-847. https://doi.org/10.3390/info6040836
Wang, Wang, et al. The Big Data Applications in Film Industry Chain. International Journal of Database Theory and Application, 2016. 9(12), pp: 1-8. https://doi.org/10.14257/ijdta.2016.9.12.01