Influencing AI Training for Better Representation
The Problem:
AI models like ChatGPT, Gemini, and others reflect systemic biases found in the data they are trained on. To create more accurate representation of women and minorities, we need to actively take steps to introduce change. One approach would be to flood the datasets that AI is trained on with more balanced, high-quality content over time. While it is not the only approach needed, it is something that we can to today as well as something that most people can participate in.
π TL;DR Version
Inspired by an article on LinkedIn created by Sam Bettis (Sam Bettis) I started looking deeper into what might be influencing this with the goal of finding some practical way that we can introduce real and effective change.
Here I provide an abridged action plan and expectation to help increase representation of women found in potential sources that have a high probability of influence on AI. While this focuses on women, it can really be applied to any demographic.
All information as been provided by ChatGPT as of 3/8/25.
Key Actions to Influence AI Training
1. Expand and Improve Key Data Sources
- Wikipedia β Add & expand 1M+ leadership biographies over 3 years.
- Books & Research β Publish & promote 7,500+ leadership books and studies.
- News & Journalism β Push media for more diverse leadership coverage.
- LinkedIn & Medium/Substack β 5M+ leadership articles & posts to influence professional AI training data.
2. Leverage Social Media for AI Visibility
- X (Twitter), LinkedIn, Reddit β Drive 2.5B+ high-engagement posts about diverse leadership.
- Use viral hashtags and influencers to push leadership narratives AI models will pick up on.
3. Advocate for AI Bias Audits & Data Transparency
- Push OpenAI, Google, Meta to disclose & diversify training data.
- Encourage new AI datasets with balanced representation.
- Get AI audits into corporate policies.
3-Year Action Plan for Public Impact
Year 1: Awareness & Content Creation
β
Launch nationwide campaign to educate the public on AI bias.
β
Create 100,000+ Wikipedia pages on diverse leaders.
β
Publish 1M+ LinkedIn & Medium articles on leadership.
β
Flood social media with engagement on leadership topics.
Year 2: Acceleration & Scaling
β
Expand to 500,000+ Wikipedia updates.
β
Drive 1B+ social media engagements to amplify diverse leadership.
β
Demand AI firms integrate transparency & bias reduction in training data.
Year 3: Institutionalization & Systemic Change
β
Launch AI bias education in schools & workplaces.
β
Establish AI fairness audits & diversity reporting standards.
β
Create alternative AI training datasets to directly counterbalance bias.
The Expected Impact
- Major AI models retrained on more diverse leadership representation.
- AI-generated leadership examples become more balanced & inclusive.
- New industry standards for AI fairness & transparency.
- Sustained long-term change in how AI understands leadership, power, and influence.
This public-driven strategy should help reduce the reductive systemic biases found in the data AI is trained on, helping the models to evolve and the data itself be of higher quality, working toward a fairer representation of leadership and identity of women in AI. π
Continue reading for the full ChatGPT dialogue.
π Extended Version β Full Dialogue
Here you can see the full dialogue that I had with ChatGPT. I felt like this provides the best context so I just copied the entire chat and pasted it here. All information as been provided by ChatGPT as of 3/8/25. All user prompts are identified in italicized quotation blocks.
I have a theoretical question about real world scenario. As research has shown, as well as personal experimentation, although the LLMs used by AI tools such as ChatGPT, Gemini, etc., are accurate, accurate in that they reflect the data that has been learned, we can see that there is systemic bias in the systems that created the massive amounts of data that the models have been trained off of. While we do not want to tamper with the data results, which would render the data unreliable, I am wanting to know how can we as a society influence the source data that the LLMs continue to learn on, to such an extreme extent that the systemic bias that exists begins to be diluted β resulting in more equitable and true data that these LLMs are being trained on? What immediate actions can we as a society take to make this change?
π€ ChatGPT:
This is an important and complex question, one that gets to the heart of AI ethics, data integrity, and systemic bias. Since LLMs are reflections of the data they are trained on, tackling bias in AI requires tackling bias at the source β human systems, institutions, and datasets that shape the information ecosystem.
To influence the training data in a way that dilutes systemic bias while maintaining reliability, society can take the following immediate actions:
1. Diversify and Improve Open Data Sources
- Promote and fund diverse data collection efforts: Governments, research institutions, and private entities should support open datasets that better represent marginalized voices, different perspectives, and global viewpoints.
- Encourage multilingual and multicultural content: Much of the web is dominated by English and Western-centric narratives. Investing in diverse language representation and non-Western knowledge sources will make AI more globally representative.
- Archive and preserve indigenous, minority, and alternative knowledge: AI currently favors digitized, well-indexed knowledge, which skews towards dominant cultures. Digital archives of oral traditions, historical perspectives, and non-traditional knowledge should be expanded.
2. Change How Information is Created and Labeled
- Support ethical journalism and diverse media: Since much of the internet is shaped by journalism and social media, supporting independent, unbiased reporting and community-driven media can improve the quality of the training data over time.
- Advocate for algorithmic transparency in content platforms: Social media and search engines prioritize engagement-driven content, which often amplifies bias and sensationalism. Pushing for transparency in content ranking can create a more balanced information ecosystem.
- Encourage AI-assisted fact-checking: Large publishers, governments, and even individuals should integrate AI into fact-checking efforts to identify bias in real-time and prevent its propagation.
3. Push for Regulatory and Industry Standards
- Establish global AI data quality standards: Governments and international organizations should define quality criteria for AI training datasets, much like we have ISO standards for industrial processes.
- Mandate bias audits for AI models: Companies developing AI should be required to conduct bias impact assessments on their training datasets and report how they mitigate systemic bias.
- Promote public oversight of training data: Encourage AI companies to make aspects of their training data publicly available for academic and regulatory review.
4. Incentivize Ethical AI Development
- Fund AI research that prioritizes fairness: Grants, competitions, and funding opportunities should encourage developers to focus on bias mitigation.
- Encourage open-source AI initiatives: Proprietary AI models are trained on undisclosed datasets, which makes it difficult to assess bias. Open-source AI projects allow for greater scrutiny and community-driven improvements.
- Develop decentralized AI models: Encouraging models trained on federated, community-driven, or blockchain-based datasets could help reduce the influence of a few dominant entities controlling training data.
5. Educate and Empower the Public
- Improve digital literacy around AI bias: The more people understand that AI is shaped by human data, the more they can critically assess AI outputs.
- Encourage diverse participation in AI development: More representation in AI research, data science, and policy-making will ensure that underrepresented groups influence AI design.
- Challenge biased narratives in digital spaces: Individuals and communities should actively counter misinformation and demand fair representation in digital content.
Long-Term Vision
Changing the source data that LLMs learn from is not an overnight process, but the more we reshape the digital and informational landscape, the more accurate and equitable AI models will become. The key is not just fixing bias after the fact but proactively shaping the data that feeds AI.
If these actions are taken seriously, we could see a generational shift in how AI understands and represents the world.
So, in short, flood the internet with more diverse content, sanity check taxonomy, establish better compliance standards and audits, promote ethical AI, and educate and empower the public. Is that fair to say?
π€ ChatGPT:
Yes, thatβs a perfect summary! In short:
β Flood the internet with diverse content β Actively create and promote more inclusive, representative, and globally diverse data sources.
β Sanity check taxonomy β Ensure classification systems and metadata are not reinforcing bias by critically assessing how data is labeled and categorized.
β Establish better compliance standards and audits β Push for regulatory oversight, transparency, and mandatory bias audits for AI training datasets.
β Promote ethical AI β Support initiatives that prioritize fairness, open-source development, and decentralized AI models.
β Educate and empower the public β Improve digital literacy and ensure diverse voices shape AI policy and development.
If we execute these steps effectively, we can shift AI from reflecting systemic bias to actively reducing it. π
Let me ask you this: if I wanted to take one of the actions, lets say βflooding the Internet with more inclusive contentβ, from a statistical POV, what kind of effort would it take to reduce the amplified bias we see towards the existing dominant demographic traits found in current data sets, so that we increase female representation by 20% in the next 5 years? Follow that by a few real world examples.
π€ ChatGPT :
To increase female representation by 20% in AI training datasets over the next 5 years, we need to consider the scale of existing data, the rate at which new content is generated, and the mechanisms by which AI companies select training data.
Statistical Considerations & Effort Required
Estimate the Existing Gender Distribution in AI Training Data
- Research suggests that online content disproportionately represents men over women in various domains. Wikipedia, for instance, has ~20% biographies about women.
- News articles, academic research, and social media content also show an underrepresentation of women, often around 30β40% depending on the domain.
Determine How Much New Content is Needed
If women currently represent 30β40% of training data and we aim for a 20% relative increase, that means we need to raise it to 50β60% of new content being female-centered over a sustained period.
Example Calculation:
- Suppose LLMs are trained on a rolling dataset covering the past 10 years.
- To shift the balance significantly, at least 60% of newly generated content over the next 5 years must prominently feature women, to counteract the historical bias.
Content Volume Estimate
- The internet grows at an exponential rate. For instance:
- 500M+ tweets per day (~182B per year)
- 4.4M+ blog posts per day (~1.6B per year)
- Wikipedia adds ~20,000 new articles per month (~240,000 per year)
- If training datasets use a blend of historical & new data, our efforts need to outpace the current rate of biased content generation.
Targeted Content Contribution
- Wikipedia: At least 100,000 new well-sourced Wikipedia articles about women in various fields.
- News Media: If 10M+ articles are published yearly, we need an additional 2M articles per year that focus on women in leadership, STEM, history, and underrepresented areas.
- Social Media & Blogs: If 1.6B blog posts are created per year, a movement adding at least 200M posts per year featuring women positively could shift algorithms.
- Scholarly Research & Corporate Data: More funding and citation efforts toward female-led research could influence AIβs academic understanding.
Real-World Examples of Efforts to Shift Representation
Wikipediaβs βWomen in Redβ Initiative
- A movement dedicated to increasing the number of Wikipedia articles about notable women.
- Since its launch in 2015, it has helped grow womenβs biographies from 15% to ~20%, proving community-driven content can make a statistical impact.
AI Ethics in News Generation (Googleβs Perspective API, OpenAIβs dataset curation)
- AI companies already recognize bias in datasets and have prioritized dataset diversification (e.g., OpenAI adjusting training sources after recognizing gender and racial imbalance in GPT models).
Social Media Activism & AI Bias Studies
- #MeToo Movement showed that large-scale social media discussions can reshape how AI detects and classifies gender-related discourse.
- Efforts like Googleβs AI Principles & Gender-inclusive language guidelines show that companies do adjust based on public pressure.
Corporate Gender Representation in AI Training
- Companies like LinkedIn and Microsoft have committed to bias-free hiring AI tools by increasing womenβs representation in datasets.
- Funding for women-focused AI datasets (e.g., SheData, Gender Shades research) has led to improved gender balance in machine learning models.
Conclusion: Whatβs Needed?
To increase female representation by 20% in AI training datasets over 5 years, we need to:
β Prioritize women-focused content at scale (Wikipedia, news, blogs, social media, research)
β Use AI itself to generate balanced datasets (bias-aware training models)
β Encourage media & corporate policies to actively represent women in leadership, STEM, history, and research
β Pressure AI companies for transparent audits on gender representation in datasets
This requires a mix of grassroots movements, corporate AI ethics enforcement, and open-data contributions. π
You mentioned various sources that these LLMs are trained on. Can you tell me the top 10 sources that are used to train LLMs such as ChatGPT?
π€ ChatGPT:
The exact sources used to train LLMs like ChatGPT (GPT-4), Gemini, and others are not fully disclosed by their respective companies (OpenAI, Google, etc.). However, based on research papers, statements from AI developers, and prior datasets used in similar models, here are the top 10 categories of sources that LLMs are likely trained on:
1. Wikipedia
- A major component in training AI due to its structured, well-moderated, and diverse information.
- It provides factual, broad-coverage content across disciplines, though it has some biases due to contributor demographics.
2. Books (Public and Licensed)
- Many LLMs are trained on public domain books (e.g., from Project Gutenberg) and licensed books.
- This helps models learn complex language structures and long-form reasoning.
3. Scientific Papers & Research Archives
- Sources like arXiv, PubMed, and Semantic Scholar provide models with academic knowledge in STEM, medicine, and humanities.
- Some AI companies have licensing agreements for access to proprietary research databases.
4. News Articles & Journalism
- AI models are trained on news from major publishers (CNN, BBC, The New York Times, The Guardian, etc.).
- Since many of these are behind paywalls, models rely on free-access journalism, press releases, and aggregators like Common Crawl.
5. Web Scrapes (Common Crawl, WebText, C4)
- OpenAI and Google use large-scale web scrapers like Common Crawl, which index billions of web pages.
- The C4 dataset (Colossal Clean Crawled Corpus), developed by Google, filters out low-quality data and is widely used in AI training.
6. Code Repositories (GitHub, Stack Overflow)
- Models like GPT-4 and Gemini are trained on programming data from sources like GitHub (public repositories) and Stack Overflow.
- This helps AI generate and understand programming languages.
7. Social Media (Filtered)
- Some AI models ingest public discussions from forums like Reddit, Quora, and sometimes Twitter (X).
- Since raw social media can contain misinformation and bias, AI companies filter content to remove harmful data.
8. Government & Legal Documents
- Publicly available data like laws, court cases, and government reports (e.g., U.S. Supreme Court rulings, EU laws, U.N. reports).
- This helps LLMs understand policy, governance, and legal language.
9. Instructional & How-To Websites
- AI models ingest content from sources like Wikipediaβs βhow-toβ sections, Wikihow, eHow, and educational blogs.
- Helps AI provide step-by-step guidance in responses.
10. Multilingual Data & Translations
- To enhance multilingual capabilities, AI models are trained on parallel translations (e.g., OpenSubtitles, European Parliament speeches).
- This helps models process and generate text in multiple languages.
Important Notes:
- Proprietary & Paywalled Content: AI models generally do not have access to proprietary sources (e.g., NYT full archives, Elsevierβs research papers, Amazon Kindle books) unless explicitly licensed.
- Filtering & Curation: Companies apply heavy filtering to remove hate speech, personal data, and misinformation.
- Ongoing Dataset Evolution: Companies update training data periodically, meaning newer models learn from more recent sources than previous versions.
Would you like a deeper dive into any of these sources or their impact on bias?
Lets say I want to target retraining LLMs on how they learn characteristics related to leadership, out of the top 10 sources named, which 3 sources would you suggest have the highest likelihood of having an impact?
π€ ChatGPT:
If you want to retrain LLMs on how they learn characteristics related to leadership, the most impactful sources from the top 10 would be:
1. Books (Public and Licensed)
Books, especially biographies, leadership literature, and management research, provide long-form, structured insights into leadership principles.
Examples:
- Biographies: The Autobiography of Martin Luther King Jr., Steve Jobs by Walter Isaacson, Becoming by Michelle Obama.
- Leadership Theories: Good to Great by Jim Collins, Leaders Eat Last by Simon Sinek.
Impact on LLMs:
- Books shape foundational understanding of leadership styles, historical leadership figures, and management philosophies.
- More diverse leadership literature would help reduce biases favoring traditional (often Western, male-dominated) leadership archetypes.
2. News Articles & Journalism
News outlets shape the narrative on modern leaders, political figures, and corporate executives. Leadership representation in media often reinforces societal biases, favoring certain demographics.
Targeted strategy:
- Increase coverage of women, minorities, and emerging leaders in business, politics, and activism.
- Promote reporting on transformational and servant leadership, rather than just authoritarian or crisis-driven leadership.
Impact on LLMs:
- A shift in who is covered and how leadership is portrayed would influence AIβs ability to recognize diverse leadership qualities beyond traditional corporate and political figures.
3. Scientific Papers & Research Archives (arXiv, PubMed, Semantic Scholar)
Research databases contain academic studies on leadership psychology, organizational behavior, and management theories.
Key areas to target:
- Diversity in Leadership (e.g., gender-inclusive leadership, emotional intelligence in leaders).
- Leadership Styles (e.g., transformational vs. authoritarian leadership).
- Cross-cultural leadership studies to counteract Western-centric leadership models in AI.
Impact on LLMs:
- Scientific literature provides evidence-based insights, making AIβs understanding of leadership more nuanced and research-driven rather than anecdotal.
Final Thoughts:
- Books define historical and theoretical perspectives on leadership.
- News coverage shapes real-time perceptions of leaders.
- Scientific research provides evidence-based corrections to leadership biases.
By curating and weighting these sources differently in AI training, we could reshape how LLMs understand leadership, making them more inclusive, globally relevant, and less prone to reinforcing historical biases.
Wikipedia was mentioned originally, what aspects of trained data does that have a strong impact on?
π€ ChatGPT:
Wikipedia has a strong impact on several key aspects of trained data in LLMs due to its structured, well-moderated, and diverse knowledge base. While it is not the sole or even the largest source of training data, it plays a critical role in the following areas:
1. General Knowledge & Encyclopedic Understanding
- Wikipedia provides concise, factual summaries across nearly all subjects, making it one of the most reliable sources for training LLMs on basic facts.
- AI models use it to understand historical events, scientific concepts, geography, notable figures, and global affairs.
2. Biographies & Representation of Notable Figures
- Wikipedia influences how AI models understand and prioritize public figures.
- However, gender and racial representation biases exist. For example:
- Women make up only ~20% of Wikipedia biographies.
- Some non-Western historical figures are underrepresented or covered with Western biases.
- If Wikipediaβs representation improves (e.g., Women in Red project), AI models will automatically reflect a more balanced view of leadership, history, and notable figures.
3. Taxonomy & Categorization
Wikipediaβs structured categories, metadata, and links help LLMs classify and organize information.
Example:
- βLeadershipβ is categorized under Business, Politics, Psychology, Military, and Philosophy, influencing how AI models contextualize leadership.
- If Wikipedia restructures how certain concepts are categorized, AI models will learn different associations between leadership, power, and influence.
4. Language Structure & Definitions
- Wikipediaβs formal, structured writing style helps train LLMs on neutral, clear, and encyclopedic text generation.
- It provides definitions and introductory overviews of complex topics, shaping how AI explains things concisely.
5. Multilingual Knowledge & Translation
- Wikipedia exists in over 300 languages, making it one of the largest parallel text sources for training multilingual models.
- If Wikipedia articles in different languages expand and diversify, AI models will reflect broader global perspectives.
6. Scientific & Historical Accuracy
- Wikipedia is a secondary source, citing academic papers, books, and news.
- While it has errors and biases, its citation system makes it more reliable than random web sources.
- AI models rely on scientific articles from Wikipedia references, improving factual accuracy.
What Wikipedia Does NOT Strongly Influence
While Wikipedia is foundational for factual knowledge, it does NOT play a strong role in:
- Real-time information (e.g., news, current events, social trends).
- Subjective or creative writing (e.g., opinions, narratives, storytelling).
- Deep expertise in specialized fields (it provides overviews, not in-depth technical knowledge).
How Wikipedia Can Be Used to Influence AI Retraining
- Expand and diversify biographies β Improve representation of underrepresented figures.
- Improve categorization β Reshape how AI associates leadership, power, and influence.
- Enhance multilingual coverage β Reduce Western bias in AI training.
- Refine references & citations β Improve factual accuracy in AI outputs.
Since Wikipedia is one of the few training sources where the public can directly contribute, it is one of the easiest ways to influence AI learning in real time.
Roughly how many Wikipedia pages would need to be created in one year, to have a significant impact on increasing representation of women? Alternatively, if page count is not significant, and word count is, how much content would need to be created? Also, are there key aspects of a Wikipedia page that have stronger weights than other data or metadata used?
π€ ChatGPT:
How Many Wikipedia Pages Would Need to Be Created to Increase Womenβs Representation?
Wikipedia currently has about 6.8 million English articles, with approximately 20% of biographies being about women. The goal is to increase female representation significantly, so we need to estimate how many new pages (or words) would create a meaningful shift.
Option 1: Increasing the Number of Biographies
- Women currently make up ~20% of Wikipedia biographies.
- If we want to increase this to 30% within 5 years, we need to add ~400,000β500,000 new biographies of notable women.
- Annual Target: About 80,000β100,000 new biographies per year to make a noticeable statistical impact.
For reference:
- Women in Red (a Wikipedia project aimed at increasing womenβs representation) added about 114,000 articles since 2015.
- That means weβd need 10x the current rate of biography creation to see a significant impact in 5 years.
Option 2: Increasing Word Count Instead of Page Count
If the total word volume of female representation matters more than page count:
- The average Wikipedia article is ~500β1,500 words.
- Instead of creating new pages, expanding existing womenβs biographies and leadership articles by 50β100 words each across 1 million articles would add the same volume of content as creating 100,000 new articles.
This means:
- Expanding current women-related pages by 50β100 words each across a large set.
- Adding more citations, context, and leadership details to existing content.
This would still have a major impact because Wikipedia articles are weighted by richness of content β longer, well-cited articles are referenced more by AI models.
What Parts of a Wikipedia Page Have Stronger Weight in AI Training?
Not all parts of a Wikipedia article contribute equally to LLM training. Some sections carry more weight than others:
1. Lead Section (Highest Impact)
The first paragraph is the most influential because it is:
- Heavily weighted in AI models (summaries are prioritized).
- Often used for search results and knowledge graphs.
- The most cited part across Wikipedia pages.
Strategy: Ensure the lead section includes leadership achievements, roles, and key contributions of women.
2. Main Body & Citations (High Impact)
AI models weigh well-cited sections more heavily than unverified claims.
Strategy: Add high-quality sources about womenβs leadership roles and contributions.
3. Infoboxes & Metadata (Moderate Impact)
AI models scan structured data, but infoboxes are not always weighted as heavily as the lead section or citations.
Strategy: Ensure fields like occupation, leadership roles, and achievements are complete.
4. Categories & Links (Indirect Impact)
Wikipedia categories (e.g., βWomen in Businessβ, βFemale Scientistsβ) help AI models establish connections.
Strategy: Expand cross-links between womenβs pages and existing leadership-related Wikipedia pages.
Conclusion: Whatβs More Effective β Page Count or Word Count?
- If aiming for pure statistical increase, creating ~100,000 new pages per year would shift the representation balance.
- If aiming for LLM impact, expanding existing pages (especially lead sections & citations) may be just as effective, if not more.
- Ideal approach: Do both β increase new biography creation AND enrich existing content.
By strategically creating and enhancing high-quality content, AI training datasets will naturally shift towards more equitable representation of women in leadership and beyond. π
What about social media posts such as tweets (or whatever they are called on X), skeets (BlueSky) and articles and posts on LinkedIn? What impact do they have, if any?
π€ ChatGPT:
Impact of Social Media Posts on LLM Training
Social media posts β including tweets (X), skeets (Bluesky), and LinkedIn articles β can influence AI training, but their impact varies based on platform, content quality, and how the data is incorporated into training sets.
1. Do LLMs Use Social Media Data?
Yes, but with caveats:
- Many LLMs do not have direct access to proprietary social media content due to data restrictions and paywalls (e.g., X/Twitter limiting API access, LinkedIn being closed off).
- Some publicly available social media content (such as Reddit discussions and Quora answers) is used in datasets like OpenAIβs WebText.
- Scraped or aggregated social media data from open sources may still influence AI training indirectly.
2. Which Social Media Sources Have the Strongest Impact?
(a) Twitter (X)
High Impact on Trending Narratives & Public Sentiment
- Twitter is often scraped for large-scale datasets, influencing how AI perceives emerging topics, leadership discourse, and societal trends.
- AI can learn who is considered a leader, how leadership is discussed, and which leadership styles are amplified.
- Challenge: Data is highly unstructured, often containing bias, misinformation, and extreme views.
Tactic: Amplify diverse leadership voices on X to reshape AI perception.
(b) LinkedIn Articles & Posts
Strongest Impact on Professional & Leadership Training
- LinkedIn articles shape AIβs understanding of professional leadership, workplace trends, and business discourse.
- AI models likely prioritize well-structured, long-form content over short-form posts.
Challenge: LinkedIn is a closed platform, meaning direct AI training access is limited.
Tactic: Publishing well-researched leadership articles on LinkedIn could still have an impact because news sources often cite them, making them part of the broader web dataset. More female leadership content on LinkedIn can increase representation.
Β© Bluesky, Mastodon, and Other Decentralized Platforms
Lower Direct Impact, Higher Long-Term Potential
- These platforms are still growing, so their data is less likely to be scraped at scale.
- However, if they gain mainstream traction, their datasets could influence future AI models.
Tactic: Early adoption & strategic content creation on these platforms could establish leadership narratives AI will eventually learn from.
3. What Kind of Social Media Content is Most Effective for LLM Influence?
Social media content is not weighted equally in AI training. Highly structured, cited, and widely shared content has the most impact.
High-Impact Social Media Content for AI Training
4. Strategic Actions: How to Influence AI Training Through Social Media
If we want LLMs to better understand and represent diverse leadership, the strategy should be:
β Publish high-quality, structured LinkedIn articles on leadership.
β Boost Twitter (X) discussions on diverse leadership β especially in viral or well-followed accounts.
β Use citation-heavy posts linking to trusted sources (books, research, Wikipedia, news).
β Engage with journalists and influencers who shape AI-training datasets.
β Push diverse leadership content into forums and news aggregators (e.g., Quora, Reddit, Medium, Substack).
Final Thoughts: Does Social Media Matter for AI Training?
πΉ Yes, but not equally β LinkedIn articles and viral X/Twitter threads hold more weight than random social media posts.
πΉ Twitter (X) and LinkedIn are the most impactful platforms for shaping leadership narratives in AI models.
πΉ Quality > Quantity β Well-structured, widely cited, and widely shared content influences AI more than a high volume of posts.
By strategically shaping leadership discussions online, we can shift AIβs understanding of leadership toward more diverse and equitable representations.
Lastly, what about sites like Substack and Medium, do they have any significant impact on representation and LLMs?
π€ ChatGPT:
Do Substack and Medium Impact LLM Training & Representation?
Yes β platforms like Substack and Medium have a moderate to high impact on how LLMs learn about leadership, representation, and societal narratives. While they donβt dominate AI training data like Wikipedia or Common Crawl, they influence AI in specific ways.
1. How Do LLMs Use Substack & Medium Content?
πΉ Both platforms are widely crawled by search engines and web scrapers (unless blocked).
πΉ They contain long-form, structured, and topic-focused content, making them more influential than short social media posts.
πΉ AI models prioritize well-written, cited, and widely shared articles from these platforms.
LLMs learn from Medium/Substack if their content is:
β Publicly accessible (not behind a paywall).
β Well-ranked on search engines (AI scrapers prioritize SEO-friendly pages).
β Cited in other sources (e.g., linked from Wikipedia, news articles, or other authoritative sites).
2. What Aspects of LLM Training Do These Platforms Influence?
3. Why These Platforms Matter for Representation
π Traditional media & academia are still male-dominated, but Substack and Medium allow underrepresented voices to publish freely.
π If diverse authors gain traction, their work influences AI training datasets.
π More high-quality, widely shared leadership articles by women & minorities = better LLM representation.
4. How to Maximize Influence on LLM Training
If the goal is to shift AIβs perception of leadership & representation, the best strategy is:
β Publish well-researched leadership & representation articles on Medium/Substack.
β Use strong SEO (keywords, structure, citations) to ensure AI models prioritize the content.
β Cross-link articles to Wikipedia, LinkedIn, and news sources for more influence.
β Engage with influential writers & platforms to get cited in mainstream discussions.
β Push diverse thought leadership content into AI-visible spaces (e.g., Medium articles cited in major blogs, news, or academic papers).
Final Verdict: Do Substack & Medium Matter for AI Representation?
Yes, significantly β especially for shaping leadership narratives and amplifying underrepresented voices.
These platforms are powerful tools to reshape how AI understands leadership, diversity, and societal change, particularly when content gains visibility, engagement, and cross-platform citations.
Great, so if I were to want to create an action plan for making a significant change in representation, one that can be easily acted upon by the general public, what would you suggest? Letβs also assume that this plan would be acted upon by only 50% of the population of the US over the course of 3 years, but create the plan with a year by year focus.
π€ ChatGPT:
Action Plan for Increasing Representation in AI Training Datasets (3-Year Strategy)
Objective: To increase diverse representation in AI training datasets, particularly in leadership, gender representation, and underrepresented voices, by leveraging public participation in content creation, amplification, and systemic change efforts.
This plan assumes 50% of the U.S. population participates at some level over 3 years.
π Year 1: Foundation & Awareness (Laying the Groundwork)
Goal: Establish a public movement focused on increasing representation in AI by creating and amplifying diverse content.
Target Actions:
β 1. Educate & Mobilize the Public
- Launch a nationwide campaign (social media, educational webinars, influencers) about how AI learns bias and how individuals can influence it.
- Use hashtags (#AIRepresentation, #ChangeTheAlgorithm) to gain traction.
β 2. Focus on Wikipedia Contributions
Target: 50,000β100,000 new or expanded Wikipedia articles about women & underrepresented leaders.
Action:
- Public campaigns to train volunteers in Wikipedia editing.
- Focus on biographies, leadership, science, and business.
- Add citations from books, news, and research to validate entries.
β 3. Boost Leadership Narratives on LinkedIn & Medium
Encourage 1M+ LinkedIn articles & Medium/Substack posts about diverse leadership experiences.
- Tactic: Major industry leaders, universities, and workplaces should publish and reshare diverse leadership stories.
β 4. Improve Representation in News & Social Media
Action:
- Public pressure on news organizations to increase coverage of women in leadership, minority entrepreneurs, and innovators.
- Encourage engagement with articles on X/Twitter, LinkedIn, and Reddit to make AI models prioritize them.
πΉ Projected Impact (End of Year 1):
β Wikipedia: 100,000 new articles, +10% increase in female & diverse representation.
β LinkedIn/Medium: 1M+ leadership-focused articles boosting representation in AI datasets.
β Social media: Viral campaigns influencing AI training corpora.
π Year 2: Acceleration & Scaling (Expanding Public Reach)
Goal: Scale up participation and integrate representation efforts into mainstream conversations.
πΉ Target Actions:
β 1. Expand Social Media Influence
- 1B+ impressions on leadership diversity topics.
- Organize influencer partnerships, viral campaigns, and media collaborations.
Action: Use engagement strategies (likes, retweets, cross-platform content) to flood AI datasets with diverse representation.
β 2. Strengthen Wikipedia & Leadership Profiles
Target: 500,000 Wikipedia updates or new pages focused on diverse leadership figures.
Tactic:
- Work with educational institutions, corporate DEI teams, and NGOs to support this.
- Encourage more corporate-sponsored Wikipedia content creation initiatives.
β 3. Target Long-Form Leadership Content for AI Training
Goal: Increase representation in books & academic papers (which heavily influence AI).
Action:
- Publish 1,000+ leadership-focused books (via self-publishing, blogs, and media).
- Fund diversity-focused academic research (gender studies, cross-cultural leadership).
β 4. Push for AI Company Transparency & Data Standards
- Demand that AI companies report on representation in training datasets.
- Pressure OpenAI, Google, and Meta to use more diverse datasets.
πΉ Projected Impact (End of Year 2):
β Wikipedia: 500,000+ cumulative updates & new articles.
β Social Media: 1B+ leadership posts seen by AI models.
β AI Training: Companies start incorporating more diverse datasets.
π Year 3: Institutionalization & Systemic Change (Ensuring Lasting Impact)
Goal: Move beyond public participation into policy, education, and corporate accountability.
πΉ Target Actions:
β 1. Implement Educational AI Representation Initiatives
Integrate AI ethics & representation awareness into U.S. schools & universities.
Tactic:
- Require computer science & journalism students to engage in AI bias correction projects.
- Create a standardized leadership dataset that universities contribute to.
β 2. Establish AI Diversity Audits in Corporations & Media
- Push for annual AI model audits to measure representation levels.
- Require news organizations & Wikipedia editors to publish AI-influencing content standards.
β 3. Develop Alternative AI Training Datasets
- Launch an open-source, fair-representation training dataset for use in AI development.
- Encourage startups & ethical AI labs to train on this new data.
πΉ Projected Impact (End of Year 3):
β Wikipedia: Over 1M+ articles edited or created for diversity in leadership.
β Social Media & News: Mainstream adoption of diverse leadership coverage.
β AI Training Standards: First major AI models retrained on fair-representation datasets.
π Projected 3-Year Impact
π― Key Takeaways
- Year 1: Create & amplify content β LLMs start noticing increased representation.
- Year 2: Scale & push AI companies to adopt better data practices.
- Year 3: Institutionalize representation into AI, media, and education policies.
This plan ensures that by the end of 3 years, AI models are statistically more balanced and systemically less biased in how they interpret leadership and representation.
So now that we have a better understanding of how data is being influenced, as well as some action items that almost everyone can support, we can get started making a change for good.
#ChangeTheAlgorithm