How to use data in the social sciences?

With the development of computers, smart phones and other electronic technologies, network data has increased dramatically, prompting social scientists to discover new problems or use new methods to solve old problems. Economists, political scientists, and sociologists can use online data such as Google, Twitter, Facebook, and Web Blog to study issues such as public opinion, information flow, and disease transmission. The use of online big data has three fundamental advantages in social research (Johnson and Smith, 2017). First, collecting network data takes less time and money than traditional questionnaire data. Traditional questionnaires require time and money to train investigators and ask questions from the sample population, but the big data approach avoids the above-mentioned time and expense. Second, big data is immediacy. Big data that is constantly updated provides the possibility to investigate emergencies in the first place. Third, big data is complete. Questionnaire researchers always face problems such as low feedback rate and item non-response, but each person's contribution to network big data has increased year by year. Although network big data has many advantages, social scientists need to consider the limitations of their existence when using network big data. This article will discuss the challenges of using network data in the social sciences: lack of representation, measurement errors, and the first type of error. In addition, the paper will list several solutions to the lack of representativeness, including calibrating network data through real-world statistics, estimating trends in data changes through dual-difference models, weighting network data, and treating network data as panel data.



1.Lack of representation

Many scholars have pointed out that there are selection biases in network data, and researchers cannot control the representation of data. As older and poorer people are less exposed to the Internet, online data tends to exclude these people. For example, Scarborough (2018) grabbed Twitter data containing feminist keywords during Father's Day 2017 and Mother's Day. By Naïve Bayes sentimental analysis of these tweets, the authors draw the attitude of tweets to feminism in different regions. To study the degree of representation of Twitter data, the authors tested the relevance of the Twitter sentiment index to the gender attitude index in the General Social Survey. In addition, the authors studied whether individual Twitter sentiment indices of different races, genders, and educational levels can be predicted by their gender attitudes. The results show that the Twitter sentiment index for feminism is highly correlated with the gender attitude index in the comprehensive social survey. However, the correlation between Twitter sentiment index and gender attitude index is different among people of different races and education levels: non-white population and low-education population use less Twitter, Twitter emotional index and gender attitude index The correlation is also low. The above results indicate that although Twitter is an important way to understand public opinion, it is not representative of the overall population.。

2. Measurement error

In addition to the representative problem, the researchers also found that there is a measurement error in the network data. A classic case is the failure of the Google Flu trend. Lazer et al. (2014) found that the frequency of flu in Google search was not correlated with the actual flu outbreak. This means that Google search heat may not be a reliable measurement method. In addition to Google search, measurement errors also appear on social media. For example, through Facebook's “Expats Mexico” classification, researchers can study Mexican immigrants living in the United States at 18 years of age (Zagheni et al. 2017). Facebook's “foreigners” are not clearly defined and are generally based on two factors: the “residential city” and “hometown” that individuals fill in the data field, and the social network structure of their friends. The authors point out that there are potential measurement errors in such definitions: “foreigners” under this definition are not necessarily born abroad, and the personal data filled out by users is not necessarily true. Such measurement errors are difficult to solve. Models based on such data are often re-calibrated.

3. It is more prone to the first type of error

The first type of error occurs when the significant relationship between the two variables is due to accidental rather than real relationships (Barocas and Selbst 2016). This type of problem is more likely to occur when researchers add a large number of variables to the model: the more variables that are added, the more likely they are to be discovered by chance. Given that big data includes a large amount of data and variables, researchers are more prone to first-form errors in data-driven research methods than traditional theoretically driven research methods (Boyd and Crawford 2012). ).


1. Calibration by real statistics

When faced with the lack of representativeness of network data, researchers can estimate the value of the subject by calibrating with real statistics. This approach requires functional assumptions about the relationship between the values ​​of the subjects and the data they present online, and the relationship between Internet penetration and sociodemographic variables. For example, Zagheni and Weber (2012) study the migration rate of people of different ages by observing the IP address of emails. They build functions based on age and Internet penetration rate in different countries, estimate the error value of the displaced population, and then calibrate the model according to the demographic data of European countries. By correcting the error, correct the data obtained by observing the email. The actual number of displaced people.

However, this method is only applicable to countries and regions with well-stated statistics. Zagheni and Weber (2012) found that some African countries have a small number of Internet users, low Internet penetration, and lack of comprehensive demographic data. This method does not apply.

2. Double difference model

In the absence of sophisticated statistics, researchers can also estimate trends through a two-difference model (Zagheni and Weber 2012). If the users of social media show a similar trend, the researcher can compare the difference between the change of the time and the change of the overall user in a particular group or region, so as to get the relative change trend of the group.

3. Weighting network data

Another way to reduce the error caused by the lack of representation is to weight the network data. Samples or aggregate samples of social media users can be used to calculate the weight of network data (Diaz et al. 2015). Weighting the data makes it easy to compare different user groups. For example, women send tweets in total less than men, but are more keen to send tweets for political issues. If we weight the data of female users, we can get more representative results. As mentioned above, non-white and under-educated populations lack representation on Twitter. Weighting these groups can increase their weight on Twitter and increase representation to some extent.

4. Treat network data as panel data

Finally, in the face of a lack of representation, rather than treating network data as a representative of the overall sample, we can also view it as panel data to observe changes in individuals or groups over time. For example, Diaz et al. (2015) observed the time lag between the tweet of the last time the Twitter user discussed the election and any tweet on the day during the election, and the time difference for most people was around a week. But on the day of the campaign debate, the researchers found that the time difference increased significantly, which meant that many users who were not keen on discussing the campaign had joined the discussion on key dates. In addition, these panel data can be used to study behavioral and attitudinal changes before and after certain events, especially for events that have a specific impact on certain groups. Researchers can select social media users from different groups, observe their changes before and after the event, and discover differences between groups.


in conclusion

This article lists several challenges in using network data in the social sciences: lack of representation, measurement error, and the first type of error. This paper then lists several solutions to the lack of representativeness, including calibrating network data through real-world statistics, estimating trends in data changes through a dual-difference model, weighting network data, and treating network data as panel data. Although network data provides more research resources for social science, researchers should consider the particularity of the network when using network data, find the lack of data, and minimize the difference between network data and real data.