Sociological Methods & Research最新文献_第4页

From Codebooks to Promptbooks: Extracting Information from Text with Generative Large Language Models 从代码本到提示本：用生成式大型语言模型从文本中提取信息

IF 6.3 2区社会学 Q1 SOCIAL SCIENCES, MATHEMATICAL METHODS

Sociological Methods & Research

Pub Date : 2025-06-25 DOI: 10.1177/00491241251336794

Oscar Stuhler, Cat Dang Ton, Etienne Ollion

Generative AI (GenAI) is quickly becoming a valuable tool for sociological research. Already, sociologists employ GenAI for tasks like classifying text and simulating human agents. We point to another major use case: the extraction of structured information from unstructured text. Information Extraction (IE) is an established branch of Natural Language Processing, but leveraging the affordances of this paradigm has thus far required familiarity with specialized models. GenAI changes this by allowing researchers to define their own IE tasks and execute them via targeted prompts. This article explores the potential of open-source large language models for IE by extracting and encoding biographical information (e.g., age, occupation, origin) from a corpus of newspaper obituaries. As we proceed, we discuss how sociologists can develop and evaluate prompt architectures for such tasks, turning codebooks into “promptbooks.” We also evaluate models of different sizes and prompting techniques. Our analysis showcases the potential of GenAI as a flexible and accessible tool for IE while also underscoring risks like non-random error patterns that can bias downstream analyses.

生成式人工智能（GenAI）正迅速成为社会学研究的一个有价值的工具。社会学家已经在使用GenAI来完成文本分类和模拟人类代理等任务。我们指出另一个主要用例：从非结构化文本中提取结构化信息。信息提取（IE）是自然语言处理的一个已建立的分支，但是利用这种范式的功能迄今为止需要熟悉专门的模型。GenAI改变了这一点，它允许研究人员定义自己的IE任务，并通过有针对性的提示执行这些任务。本文通过从报纸讣告语料库中提取和编码传记信息（例如，年龄、职业、出身），探索了开源大型语言模型在IE中的潜力。在我们继续讨论的过程中，我们将讨论社会学家如何为这些任务开发和评估提示架构，将代码本变成“提示本”。我们还评估了不同大小的模型和提示技术。我们的分析显示了GenAI作为一种灵活且易于使用的IE工具的潜力，同时也强调了非随机错误模式等风险，这些错误模式可能会影响下游分析。

{"title":"From Codebooks to Promptbooks: Extracting Information from Text with Generative Large Language Models","authors":"Oscar Stuhler, Cat Dang Ton, Etienne Ollion","doi":"10.1177/00491241251336794","DOIUrl":"https://doi.org/10.1177/00491241251336794","url":null,"abstract":"Generative AI (GenAI) is quickly becoming a valuable tool for sociological research. Already, sociologists employ GenAI for tasks like classifying text and simulating human agents. We point to another major use case: the extraction of structured information from unstructured text. Information Extraction (IE) is an established branch of Natural Language Processing, but leveraging the affordances of this paradigm has thus far required familiarity with specialized models. GenAI changes this by allowing researchers to define their own IE tasks and execute them via targeted prompts. This article explores the potential of open-source large language models for IE by extracting and encoding biographical information (e.g., age, occupation, origin) from a corpus of newspaper obituaries. As we proceed, we discuss how sociologists can develop and evaluate prompt architectures for such tasks, turning codebooks into “promptbooks.” We also evaluate models of different sizes and prompting techniques. Our analysis showcases the potential of GenAI as a flexible and accessible tool for IE while also underscoring risks like non-random error patterns that can bias downstream analyses.","PeriodicalId":21849,"journal":{"name":"Sociological Methods & Research","volume":"20 1","pages":""},"PeriodicalIF":6.3,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144479192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Geographic Variation in Multigenerational Mobility 多代流动的地理差异

IF 6.3 2区社会学 Q1 SOCIAL SCIENCES, MATHEMATICAL METHODS

Sociological Methods & Research

Pub Date : 2025-06-20 DOI: 10.1177/00491241251341196

Martin Nybom, Jan Stuhler

Using complete-count register data spanning three generations, we document spatial patterns in inter- and multi-generational mobility in Sweden. Across municipalities, grandfather–child correlations in education or earnings tend to be larger than the square of the parent–child correlations, suggesting that the latter understate status transmission in the long run. Yet, conventional parent–child correlations capture regional differences in long-run transmission and therefore remain useful for comparative purposes. We further find that the within-country association between mobility and income inequality (the “Great Gatsby Curve”) is at least as strong in the multi- as in the inter-generational case. Interpreting those patterns through the lens of a latent factor model, we find that regional differences in mobility primarily reflect variation in the transmission of latent advantages, rather than in how those advantages translate into observed outcomes.

使用跨越三代的完整计数寄存器数据，我们记录了瑞典代际和多代流动性的空间模式。在各个城市，祖父与子女在教育或收入方面的相关性往往大于父母与子女之间相关性的平方，这表明从长远来看，后者低估了地位的传递。然而，传统的亲子相关性反映了长期传播中的区域差异，因此仍可用于比较目的。我们进一步发现，流动性和收入不平等之间的国家内部关联（“了不起的盖茨比曲线”）至少在多代情况下与代际情况一样强烈。通过一个潜在因素模型来解释这些模式，我们发现流动性的区域差异主要反映了潜在优势传递的差异，而不是这些优势如何转化为观察到的结果。

引用次数: 0

Absolute and Relative Mobility: Two Frameworks for Connecting Intergenerational Mobility in Absolute and Relative Terms 绝对流动与相对流动：两种连接绝对与相对代际流动的框架

IF 6.3 2区社会学 Q1 SOCIAL SCIENCES, MATHEMATICAL METHODS

Sociological Methods & Research

Pub Date : 2025-06-20 DOI: 10.1177/00491241251347982

Deirdre Bloome

Researchers concerned about intergenerational inequalities study absolute and relative mobility (e.g., whether people’s adult incomes exceed their parents’ incomes in dollars or ranks ). Absolute and relative mobility are connected, by definition. Yet, they are not equivalent. Indeed, they often diverge. To illuminate why, when, and for whom such divergence occurs—and why, when, and for whom convergence is possible—this article provides two frameworks for connecting absolute and relative mobility. One framework is formal and one is typological. Both frameworks center micro-level socioeconomic experiences across generations. Illustrative analyses employ these frameworks using National Longitudinal Survey of Youth data. Results suggest that divergent experiences, like upward absolute mobility despite downward relative mobility, may be more common among more advantaged social groups. Future researchers could use the two frameworks introduced here to further advance our understanding of how intergenerational inequalities evolve differently in absolute and relative terms.

关注代际不平等的研究人员研究绝对和相对流动性（例如，人们的成年收入是否在美元或排名上超过了父母的收入）。根据定义，绝对流动性和相对流动性是联系在一起的。然而，它们并不等同。事实上，它们经常出现分歧。为了阐明为什么、什么时候、为谁而发生这种分歧，以及为什么、什么时候、为谁而可能趋同，本文提供了两个框架来连接绝对流动性和相对流动性。一个框架是形式的，一个是类型的。这两个框架都以跨代微观层面的社会经济经验为中心。说明性分析利用国家青年纵向调查数据采用这些框架。结果表明，不同的经验，如向上的绝对流动性，而向下的相对流动性，可能在更有利的社会群体中更常见。未来的研究人员可以使用这里介绍的两个框架来进一步推进我们对代际不平等在绝对和相对方面如何演变不同的理解。

{"title":"Absolute and Relative Mobility: Two Frameworks for Connecting Intergenerational Mobility in Absolute and Relative Terms","authors":"Deirdre Bloome","doi":"10.1177/00491241251347982","DOIUrl":"https://doi.org/10.1177/00491241251347982","url":null,"abstract":"Researchers concerned about intergenerational inequalities study absolute and relative mobility (e.g., whether people’s adult incomes exceed their parents’ incomes in dollars or ranks ). Absolute and relative mobility are connected, by definition. Yet, they are not equivalent. Indeed, they often diverge. To illuminate why, when, and for whom such divergence occurs—and why, when, and for whom convergence is possible—this article provides two frameworks for connecting absolute and relative mobility. One framework is formal and one is typological. Both frameworks center micro-level socioeconomic experiences across generations. Illustrative analyses employ these frameworks using National Longitudinal Survey of Youth data. Results suggest that divergent experiences, like upward absolute mobility despite downward relative mobility, may be more common among more advantaged social groups. Future researchers could use the two frameworks introduced here to further advance our understanding of how intergenerational inequalities evolve differently in absolute and relative terms.","PeriodicalId":21849,"journal":{"name":"Sociological Methods & Research","volume":"7 1","pages":""},"PeriodicalIF":6.3,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144328660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Is There a Mobility Effect? On Methodological Issues in the Mobility Contrast Model 是否存在流动性效应？流动性对比模型的方法论问题

IF 6.3 2区社会学 Q1 SOCIAL SCIENCES, MATHEMATICAL METHODS

Sociological Methods & Research

Pub Date : 2025-06-19 DOI: 10.1177/00491241251347983

Xi Song, Xiang Zhou

Social mobility scholars have long been interested in estimating the effect of intergenerational mobility, typically measured by differences in the socioeconomic status between parents and offspring, on later-life outcomes of offspring. In a 2022 article “Heterogeneous Effects of Intergenerational Social Mobility: An Improved Method and New Evidence,” Luo proposes a new approach called the mobility contrast model (MCM) to define and estimate mobility effects. We argue that the MCM is inherently flawed due to its reliance on the coding scheme used for the categorical variables of social origin and destination. Specifically, when different coding schemes are applied, the estimands defined in the MCM bear distinct meanings, involve different but equally arbitrary constraints, and sometimes yield contradictory results. Moreover, regardless of the coding scheme, these estimands do not adequately capture the sociological concept of a mobility effect. To illustrate this, we reanalyze the Occupational Changes in a Generation Study data used in Luo’s study, highlighting the inconsistency of results when dummy coding versus effect coding schemes are used.

社会流动性学者长期以来一直对估计代际流动性的影响感兴趣，代际流动性通常通过父母和子女之间社会经济地位的差异来衡量，对后代晚年的影响。在2022年的一篇文章《代际社会流动的异质效应：一种改进的方法和新证据》中，罗提出了一种称为流动性对比模型（MCM）的新方法来定义和估计流动性效应。我们认为，由于MCM依赖于用于社会起源和目的地分类变量的编码方案，因此MCM本身就存在缺陷。具体来说，当采用不同的编码方案时，MCM中定义的估计具有不同的含义，涉及不同但同样任意的约束，有时会产生相互矛盾的结果。此外，无论编码方案如何，这些估计都没有充分捕捉到流动性效应的社会学概念。为了说明这一点，我们重新分析了罗研究中使用的一代人的职业变化研究数据，强调了使用虚拟编码和效果编码方案时结果的不一致性。

{"title":"Is There a Mobility Effect? On Methodological Issues in the Mobility Contrast Model","authors":"Xi Song, Xiang Zhou","doi":"10.1177/00491241251347983","DOIUrl":"https://doi.org/10.1177/00491241251347983","url":null,"abstract":"Social mobility scholars have long been interested in estimating the effect of intergenerational mobility, typically measured by differences in the socioeconomic status between parents and offspring, on later-life outcomes of offspring. In a 2022 article “Heterogeneous Effects of Intergenerational Social Mobility: An Improved Method and New Evidence,” Luo proposes a new approach called the mobility contrast model (MCM) to define and estimate mobility effects. We argue that the MCM is inherently flawed due to its reliance on the coding scheme used for the categorical variables of social origin and destination. Specifically, when different coding schemes are applied, the estimands defined in the MCM bear distinct meanings, involve different but equally arbitrary constraints, and sometimes yield contradictory results. Moreover, regardless of the coding scheme, these estimands do not adequately capture the sociological concept of a mobility effect. To illustrate this, we reanalyze the Occupational Changes in a Generation Study data used in Luo’s study, highlighting the inconsistency of results when dummy coding versus effect coding schemes are used.","PeriodicalId":21849,"journal":{"name":"Sociological Methods & Research","volume":"19 1","pages":""},"PeriodicalIF":6.3,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144328666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Social Rigidity Across and Within Generations: A Predictive Approach 跨代和代内的社会刚性：一种预测方法

IF 6.3 2区社会学 Q1 SOCIAL SCIENCES, MATHEMATICAL METHODS

Sociological Methods & Research

Pub Date : 2025-06-19 DOI: 10.1177/00491241251347984

Haowen Zheng, Siwei Cheng

How well can individuals’ parental background and previous life experiences predict their mid-life socioeconomic status (SES) attainment? This question is central to stratification research, as a strong power of earlier experiences in predicting later-life outcomes signals substantial intra- or intergenerational status persistence, or put simply, social rigidity. Running machine learning models on panel data to predict outcomes that include hourly wage, total income, family income, and occupational status, we find that a large number (around 4,000) of predictors commonly used in the stratification literature improves the prediction of one’s life chances in middle to late adulthood by about 10 percent to 50 percent, compared with a null model that uses a simple mean of the outcome variable. The level of predictability depends on the specific outcome being analyzed, with labor market indicators like wages and occupational prestige being more predictable than broader socioeconomic measures such as overall personal and family income. Grouping a comprehensive list of predictors into four unique sets that cover family background, childhood and adolescence development, early labor market experiences, and early adulthood family formation, we find that including income, employment status, and occupational characteristics at early career significantly improves models’ prediction accuracy for mid-life SES attainment. We also illustrate the application of the predictive models to examine heterogeneity in predictability by race and gender and identify important variables through this data-driven exercise.

个人的父母背景和以前的生活经历如何预测他们的中年社会经济地位（SES）的成就？这个问题是分层研究的核心，因为早期的经历在预测晚年生活结果方面具有强大的力量，表明了实质性的代际或代际地位的持久性，或者简单地说，社会刚性。在面板数据上运行机器学习模型来预测包括小时工资、总收入、家庭收入和职业状况在内的结果，我们发现，与使用结果变量的简单平均值的零模型相比，分层文献中常用的大量（约4,000）预测因子将对一个人在成年中后期的生活机会的预测提高了约10%至50%。可预测性的程度取决于所分析的具体结果，工资和职业声望等劳动力市场指标比个人和家庭总收入等更广泛的社会经济指标更可预测。我们将综合的预测因子列表分为四个独特的集合，包括家庭背景、儿童和青少年发展、早期劳动力市场经历和成年早期家庭形成，我们发现，包括收入、就业状况和职业早期职业特征显著提高了模型对中年SES成就的预测准确性。我们还说明了预测模型的应用，以检查种族和性别可预测性的异质性，并通过这种数据驱动的练习确定重要变量。

{"title":"Social Rigidity Across and Within Generations: A Predictive Approach","authors":"Haowen Zheng, Siwei Cheng","doi":"10.1177/00491241251347984","DOIUrl":"https://doi.org/10.1177/00491241251347984","url":null,"abstract":"How well can individuals’ parental background and previous life experiences predict their mid-life socioeconomic status (SES) attainment? This question is central to stratification research, as a strong power of earlier experiences in predicting later-life outcomes signals substantial intra- or intergenerational status persistence, or put simply, social rigidity. Running machine learning models on panel data to predict outcomes that include hourly wage, total income, family income, and occupational status, we find that a large number (around 4,000) of predictors commonly used in the stratification literature improves the prediction of one’s life chances in middle to late adulthood by about 10 percent to 50 percent, compared with a null model that uses a simple mean of the outcome variable. The level of predictability depends on the specific outcome being analyzed, with labor market indicators like wages and occupational prestige being more predictable than broader socioeconomic measures such as overall personal and family income. Grouping a comprehensive list of predictors into four unique sets that cover family background, childhood and adolescence development, early labor market experiences, and early adulthood family formation, we find that including income, employment status, and occupational characteristics at early career significantly improves models’ prediction accuracy for mid-life SES attainment. We also illustrate the application of the predictive models to examine heterogeneity in predictability by race and gender and identify important variables through this data-driven exercise.","PeriodicalId":21849,"journal":{"name":"Sociological Methods & Research","volume":"51 1","pages":""},"PeriodicalIF":6.3,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144319669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Causal Effect of Parent Occupation on Child Occupation: A Multivalued Treatment with Positivity Constraints 父母职业对子女职业的因果影响：一个具有正性约束的多值处理

IF 6.3 2区社会学 Q1 SOCIAL SCIENCES, MATHEMATICAL METHODS

Sociological Methods & Research

Pub Date : 2025-06-02 DOI: 10.1177/00491241251338412

Ian Lundberg, Daniel Molitor, Jennie E. Brand

To what degree does parent occupation cause a child’s occupational attainment? We articulate this causal question in the potential outcomes framework. Empirically, we show that adjustment for only two confounding variables substantially reduces the estimated association between parent and child occupation in a U.S. cohort. Methodologically, we highlight complications that arise when the treatment variable (parent occupation) can take many categorical values. A central methodological hurdle is positivity: some occupations (e.g., lawyer) are simply never held by some parents (e.g., those who did not complete college). We show how to overcome this hurdle by reporting summaries within subgroups that focus attention on the causal quantities that can be credibly estimated. Future research should build on the longstanding tradition of descriptive mobility research to answer causal questions.

父母的职业在多大程度上影响孩子的职业成就？我们在潜在结果框架中阐明了这个因果问题。从经验上看，我们表明，在美国队列中，只有两个混杂变量的调整大大降低了父母和子女职业之间的估计关联。在方法上，我们强调了当治疗变量（父母职业）可以取许多分类值时出现的并发症。一个主要的方法障碍是积极性：一些职业（例如，律师）根本没有被一些父母（例如，那些没有完成大学学业的父母）从事过。我们展示了如何通过报告子组内的摘要来克服这一障碍，这些子组将注意力集中在可以可靠估计的因果数量上。未来的研究应该建立在长期的描述性流动性研究的传统上，以回答因果问题。

引用次数: 0

Simulating Subjects: The Promise and Peril of Artificial Intelligence Stand-Ins for Social Agents and Interactions 模拟对象：人工智能代替社会代理和互动的希望与危险

IF 6.3 2区社会学 Q1 SOCIAL SCIENCES, MATHEMATICAL METHODS

Sociological Methods & Research

Pub Date : 2025-06-02 DOI: 10.1177/00491241251337316

Austin C. Kozlowski, James Evans

Large language models (LLMs), through their exposure to massive collections of online text, learn to reproduce the perspectives and linguistic styles of diverse social and cultural groups. This capability suggests a powerful social scientific application—the simulation of empirically realistic, culturally situated human subjects. Synthesizing recent research in artificial intelligence and computational social science, we outline a methodological foundation for simulating human subjects and their social interactions. We then identify six characteristics of current models that are likely to impair the realistic simulation of human subjects: bias, uniformity, atemporality, disembodiment, linguistic cultures, and alien intelligence. For each of these areas, we discuss promising approaches for overcoming their associated shortcomings. Given the rate of change of these models, we advocate for an ongoing methodological program for the simulation of human subjects that keeps pace with rapid technical progress, and caution that validation against human subjects data remains essential to ensure simulation accuracy.

大型语言模型（llm）通过接触大量在线文本，学习再现不同社会和文化群体的观点和语言风格。这种能力表明了一种强大的社会科学应用——对经验现实的、处于文化背景下的人类受试者的模拟。综合人工智能和计算社会科学的最新研究，我们概述了模拟人类受试者及其社会互动的方法论基础。然后，我们确定了当前模型的六个特征，这些特征可能会损害对人类受试者的现实模拟：偏见、一致性、非时间性、分离、语言文化和外星智能。对于这些领域中的每一个，我们讨论了克服其相关缺点的有希望的方法。考虑到这些模型的变化速度，我们提倡对人类受试者进行持续的模拟方法计划，以跟上快速的技术进步，并警告说，对人类受试者数据的验证仍然是确保模拟准确性的必要条件。

{"title":"Simulating Subjects: The Promise and Peril of Artificial Intelligence Stand-Ins for Social Agents and Interactions","authors":"Austin C. Kozlowski, James Evans","doi":"10.1177/00491241251337316","DOIUrl":"https://doi.org/10.1177/00491241251337316","url":null,"abstract":"Large language models (LLMs), through their exposure to massive collections of online text, learn to reproduce the perspectives and linguistic styles of diverse social and cultural groups. This capability suggests a powerful social scientific application—the simulation of empirically realistic, culturally situated human subjects. Synthesizing recent research in artificial intelligence and computational social science, we outline a methodological foundation for simulating human subjects and their social interactions. We then identify six characteristics of current models that are likely to impair the realistic simulation of human subjects: bias, uniformity, atemporality, disembodiment, linguistic cultures, and alien intelligence. For each of these areas, we discuss promising approaches for overcoming their associated shortcomings. Given the rate of change of these models, we advocate for an ongoing methodological program for the simulation of human subjects that keeps pace with rapid technical progress, and caution that validation against human subjects data remains essential to ensure simulation accuracy.","PeriodicalId":21849,"journal":{"name":"Sociological Methods & Research","volume":"113 1","pages":""},"PeriodicalIF":6.3,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144210944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Quantifying Narrative Similarity Across Languages 量化不同语言之间的叙事相似性

IF 6.3 2区社会学 Q1 SOCIAL SCIENCES, MATHEMATICAL METHODS

Sociological Methods & Research

Pub Date : 2025-06-02 DOI: 10.1177/00491241251340080

Hannah Waight, Solomon Messing, Anton Shirikov, Margaret E. Roberts, Jonathan Nagler, Jason Greenfield, Megan A. Brown, Kevin Aslett, Joshua A. Tucker

How can one understand the spread of ideas across text data? This is a key measurement problem in sociological inquiry, from the study of how interest groups shape media discourse, to the spread of policy across institutions, to the diffusion of organizational structures and institution themselves. To study how ideas and narratives diffuse across text, we must first develop a method to identify whether texts share the same information and narratives, rather than the same broad themes or exact features. We propose a novel approach to measure this quantity of interest, which we call “narrative similarity,” by using large language models to distill texts to their core ideas and then compare the similarity of claims rather than of words, phrases, or sentences. The result is an estimand much closer to narrative similarity than what is possible with past relevant alternatives, including exact text reuse, which returns lexically similar documents; topic modeling, which returns topically similar documents; or an array of alternative approaches. We devise an approach to providing out-of-sample measures of performance (precision, recall, F1) and show that our approach outperforms relevant alternatives by a large margin. We apply our approach to an important case study: The spread of Russian claims about the development of a Ukrainian bioweapons program in U.S. mainstream and fringe news websites. While we focus on news in this application, our approach can be applied more broadly to the study of propaganda, misinformation, diffusion of policy and cultural objects, among other topics.

人们如何理解思想在文本数据中的传播？这是社会学研究中的一个关键测量问题，从研究利益集团如何塑造媒体话语，到跨机构政策的传播，再到组织结构和机构本身的扩散。为了研究思想和叙事是如何在文本中传播的，我们必须首先开发一种方法来确定文本是否共享相同的信息和叙事，而不是相同的广泛主题或确切特征。我们提出了一种新的方法来衡量这种兴趣量，我们称之为“叙事相似性”，通过使用大型语言模型提取文本的核心思想，然后比较主张的相似性，而不是单词，短语或句子的相似性。与过去的相关替代方案相比，结果是一个更接近于叙事相似性的估计，包括精确的文本重用，它返回词汇相似的文档；主题建模，返回主题相似的文档；或者一系列的替代方法。我们设计了一种方法来提供样本外的性能度量（精度、召回率、F1），并表明我们的方法在很大程度上优于相关的替代方法。我们将我们的方法应用于一个重要的案例研究：俄罗斯关于乌克兰生物武器计划发展的说法在美国主流和边缘新闻网站上的传播。虽然我们在这个应用程序中关注的是新闻，但我们的方法可以更广泛地应用于研究宣传、错误信息、政策传播和文化对象等主题。

{"title":"Quantifying Narrative Similarity Across Languages","authors":"Hannah Waight, Solomon Messing, Anton Shirikov, Margaret E. Roberts, Jonathan Nagler, Jason Greenfield, Megan A. Brown, Kevin Aslett, Joshua A. Tucker","doi":"10.1177/00491241251340080","DOIUrl":"https://doi.org/10.1177/00491241251340080","url":null,"abstract":"How can one understand the spread of ideas across text data? This is a key measurement problem in sociological inquiry, from the study of how interest groups shape media discourse, to the spread of policy across institutions, to the diffusion of organizational structures and institution themselves. To study how ideas and narratives diffuse across text, we must first develop a method to identify whether texts share the same information and narratives, rather than the same broad themes or exact features. We propose a novel approach to measure this quantity of interest, which we call “narrative similarity,” by using large language models to distill texts to their core ideas and then compare the similarity of claims rather than of words, phrases, or sentences. The result is an estimand much closer to narrative similarity than what is possible with past relevant alternatives, including exact text reuse, which returns lexically similar documents; topic modeling, which returns topically similar documents; or an array of alternative approaches. We devise an approach to providing out-of-sample measures of performance (precision, recall, F1) and show that our approach outperforms relevant alternatives by a large margin. We apply our approach to an important case study: The spread of Russian claims about the development of a Ukrainian bioweapons program in U.S. mainstream and fringe news websites. While we focus on news in this application, our approach can be applied more broadly to the study of propaganda, misinformation, diffusion of policy and cultural objects, among other topics.","PeriodicalId":21849,"journal":{"name":"Sociological Methods & Research","volume":"62 1","pages":""},"PeriodicalIF":6.3,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144210939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Optimal Stratification Method for Addressing Nonresponse Bias in Bayesian Adaptive Survey Design 贝叶斯自适应调查设计中一种解决无反应偏差的最优分层方法

IF 6.3 2区社会学 Q1 SOCIAL SCIENCES, MATHEMATICAL METHODS

Sociological Methods & Research

Pub Date : 2025-06-02 DOI: 10.1177/00491241251345463

Yongchao Ma, Nino Mushkudiani, Barry Schouten

In a probability sampling survey, adaptive data collection strategies may be used to obtain a response set that minimizes nonresponse bias within budget constraints. Previous research has stratified the target population into subgroups defined by categories of auxiliary variables observed for the entire population, and tailored strategies to obtain similar response rates across subgroups. However, if the auxiliary variables are weakly correlated with the target survey variables, optimizing data collection for these subgroups may not reduce nonresponse bias and may actually increase the variance of survey estimates. In this paper, we propose a stratification method to identify subgroups by: (1) predicting values of target survey variables from auxiliary variables, and (2) forming subgroups with different response propensities based on the predicted values of target survey variables. By tailoring different data collection strategies to these subgroups, we can obtain a response set with less variation in response propensities across subgroups that are directly relevant to the target survey variables. Given this rationale, we also propose to measure nonresponse bias by the coefficient of variation of response propensities estimated from the predicted target survey variables. A case study using the Dutch Health Survey shows that the proposed stratification method generally produces less variation in response propensities with respect to the predicted target survey variables compared to traditional methods, thereby leading to a response set that better resembles the population.

在概率抽样调查中，可使用自适应数据收集策略来获得在预算约束下将非响应偏差最小化的响应集。以前的研究已经将目标人群分层为亚组，这些亚组是根据观察到的整个人群的辅助变量类别来定义的，并根据不同的策略在不同的亚组中获得相似的反应率。然而，如果辅助变量与目标调查变量的相关性较弱，优化这些子组的数据收集可能不会减少非反应偏差，实际上可能会增加调查估计的方差。本文提出了一种分层识别子群的方法：(1)从辅助变量中预测目标调查变量的值，(2)根据目标调查变量的预测值形成不同响应倾向的子群。通过为这些子组定制不同的数据收集策略，我们可以获得与目标调查变量直接相关的子组之间的响应倾向变化较小的响应集。鉴于这一基本原理，我们还建议通过从预测的目标调查变量估计的响应倾向变异系数来测量非响应偏差。利用荷兰健康调查进行的一项案例研究表明，与传统方法相比，拟议的分层方法对预测的目标调查变量的反应倾向产生的变化通常较小，从而导致更接近人口的反应集。

{"title":"An Optimal Stratification Method for Addressing Nonresponse Bias in Bayesian Adaptive Survey Design","authors":"Yongchao Ma, Nino Mushkudiani, Barry Schouten","doi":"10.1177/00491241251345463","DOIUrl":"https://doi.org/10.1177/00491241251345463","url":null,"abstract":"In a probability sampling survey, adaptive data collection strategies may be used to obtain a response set that minimizes nonresponse bias within budget constraints. Previous research has stratified the target population into subgroups defined by categories of auxiliary variables observed for the entire population, and tailored strategies to obtain similar response rates across subgroups. However, if the auxiliary variables are weakly correlated with the target survey variables, optimizing data collection for these subgroups may not reduce nonresponse bias and may actually increase the variance of survey estimates. In this paper, we propose a stratification method to identify subgroups by: (1) predicting values of target survey variables from auxiliary variables, and (2) forming subgroups with different response propensities based on the predicted values of target survey variables. By tailoring different data collection strategies to these subgroups, we can obtain a response set with less variation in response propensities across subgroups that are directly relevant to the target survey variables. Given this rationale, we also propose to measure nonresponse bias by the coefficient of variation of response propensities estimated from the predicted target survey variables. A case study using the Dutch Health Survey shows that the proposed stratification method generally produces less variation in response propensities with respect to the predicted target survey variables compared to traditional methods, thereby leading to a response set that better resembles the population.","PeriodicalId":21849,"journal":{"name":"Sociological Methods & Research","volume":"51 1","pages":""},"PeriodicalIF":6.3,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144210943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generative Multimodal Models for Social Science: An Application with Satellite and Streetscape Imagery 社会科学的生成多模态模型：卫星和街景图像的应用

IF 6.3 2区社会学 Q1 SOCIAL SCIENCES, MATHEMATICAL METHODS

Sociological Methods & Research

Pub Date : 2025-05-27 DOI: 10.1177/00491241251339673

Tina Law, Elizabeth Roberto

Although there is growing social science research examining how generative AI models can be effectively and systematically applied to text-based tasks, whether and how these models can be used to analyze images remain open questions. In this article, we introduce a framework for analyzing images with generative multimodal models, which consists of three core tasks: curation, discovery, and measurement and inference. We demonstrate this framework with an empirical application that uses OpenAI's GPT-4o model to analyze satellite and streetscape images ( n = 1,101) to identify built environment features that contribute to contemporary residential segregation in U.S. cities. We find that when GPT-4o is provided with well-defined image labels, the model labels images with high validity compared to expert labels. We conclude with thoughts for other use cases and discuss how social scientists can work collaboratively to ensure that image analysis with generative multimodal models is rigorous, reproducible, ethical, and sustainable.

尽管越来越多的社会科学研究探讨了如何有效和系统地将生成式人工智能模型应用于基于文本的任务，但这些模型是否以及如何用于分析图像仍然是一个悬而未决的问题。在本文中，我们介绍了一个使用生成式多模态模型分析图像的框架，该框架由三个核心任务组成：策展、发现、测量和推理。我们通过一个实证应用程序来演示该框架，该应用程序使用OpenAI的gpt - 40模型来分析卫星和街景图像（n = 1,101），以识别导致美国城市当代住宅隔离的建筑环境特征。我们发现，当gpt - 40提供定义良好的图像标签时，与专家标签相比，模型标记的图像具有更高的有效性。我们总结了其他用例的想法，并讨论了社会科学家如何协同工作，以确保具有生成式多模态模型的图像分析是严格的、可重复的、合乎道德的和可持续的。

引用次数: 0