Fast Content Marketing Analysis with NLP: GitHub vs GitLab

The complete code can be found on this Colab, everybody can use it to analyze the content marketing of other companies as well. In this article, we will deal with the results of the analysis.
We scraped the blog section of both the GitHub and GitLab websites. The total number of articles found for each company is:
- GitHub: 3822 articles, starting from the year 2008
- GitLab: 1605 articles, starting from the year 2012

GitHub steadily published about 15 articles each month till 2018, when it increased its article production up to about 35 per month till today. GitLab has increased the number of published articles each year from its inception up till 2021 with about 25 articles per month and has shown a small decrease in production only in the last months of 2021.

For both companies, most articles have been published from March to June and from August to October. Months related to holidays like December and July see fewer published articles.

Most articles have been published from Monday to Thursday, with a distinct drop on Friday. Very few articles have been published on Saturday and Sunday, I guess because fewer people read such articles during weekends.
GitHub vs GitLab: what topics do they publish about?


Both companies, of course, publish a lot of articles about Technology, Software, and Programming. There are some differences in the right tail of both distributions, but they are not easy to catch from the previous charts. The question we must answer is:
What are the topics that are covered by GitHub and not by GitLab, and vice-versa?
To do this, we sort the topics by the percent difference between the coverage of each specific topic for both companies. For example:
- GitHub published 2 articles with topic Ux Design out of 3822 total articles published, that is, GitHub coverage of Ux Design is 0.05% of its published articles.
- GitLab published 51 articles with topic Ux Design out of 1605 total articles published, that is, GitLab coverage of Ux Design is 3.18% of its published articles.
Therefore, GitLab is (3.18 - 0.05) / 0.05 = 62.6 times more likely to publish an article about Ux Design than GitHub, exactly the kind of insight that we want to know! Let’s now sort the topics by this percentage difference and see how many articles did GitHub and GitLab publish about them.


It is now much easier to notice the differences in their content marketing strategy! These are some topics that GitHub covers significantly more than GitLab:
- Gaming, Videogames, Entertainment.
- Open Source.
- Education and Learning.
These are instead some topics that GitLab covers significantly more than GitHub:
- Ux Design and Design.
- Infrastructure, Continuous Delivery, Continuous Integration, Cloud.
- Remote Working.
GitHub vs GitLab: the Games topic

GitHub published 68 articles about Games, while GitLab published none. What is GitHub exactly publishing about games and why?

GitHub is involved with numerous game jams like Game Off, js13kGames, and Ludum Dare. Game jams are events where developers enjoy building a game with an established theme in a defined time. They are moments of fun and sociality among all those video game developers, with whom I deduce GitHub wants to associate their brand.
GitHub vs GitLab: the Open Source topic

GitHub published 454 articles about Open Source, while GitLab published 42. Let’s see what are these articles about.

Even though GitHub is closed source, it has always been the home of many open source projects, and it wants to remain so. They publish lists of the best community projects each month and talks about the new features that support them in their changelogs.
GitHub vs GitLab: the Learning topic

GitHub published 112 articles about Learning, while GitLab published 14.

GitHub has a product called GitHub Classroom to conveniently manage assignments, grading, and student help, all from within GitHub. Moreover, they teach software development at their GitHub Global Campus, which is part of GitHub Education.
GitHub vs GitLab: the Ux Design topic

GitLab published 51 articles about Ux Design, while GitHub published 2.

GitLab puts a lot of emphasis on the UX Design advances made on their platform and also explains how they structure their work in this regard, talking about design systems and how to carry out remote work on UX and Design. It would appear that GitLab focuses on supporting heterogeneous teams.
GitHub vs GitLab: the Infrastructure topic

GitLab published 159 articles about Infrastructure, while GitHub published 17.

GitLab publishes many articles explaining how to do CI/CD, DevOps, and GitOps on their platform, a very common use case among their users. This is a strength of GitLab.
GitHub vs GitLab: the Remote Working topic

GitLab wrote 52 articles about Remote Working, while GitHub wrote 11 only. Moreover, GitLab is writing this kind of article since 2014 and GitHub wrote some only during the pandemic. Why?

Because GitLab is famously one of the world’s largest all-remote companies and shares its knowledge on how to do it effectively since its foundation, writing about how to scale remote work, hybrid work, in heterogeneous teams.
What NLP can do for Content Marketing
If you’ll ever find yourself writing content for a company, you’ll probably:
- Study the customers of the company to find out what are their problems and their interests;
- Study your competitors’ content strategy to get further insights;
- Produce content that interests the customers, with the assumption that other people with similar interests may become your customers in the future.
As we have seen with this article, step (2) can be improved a lot with today’s Natural Language Processing technology.
An NLP approach to studying a company’s content marketing
The analysis was done using the Python programming language and following these high-level steps:
- Scrape all the articles from the “blog” section of the company website. We use requests to make HTTP requests, fake_useragent to get plausible user agents for our requests, BeautifulSoup to parse HTML, newspaper3k to extract structured data from articles HTML and langdetect to keep only English articles.
- Predict a list of topics for each article, using its title and text. We use the EasyTagger API.
- Analyze the results. We use pandas for data frames and plotly for charts.
The complete code can be found on this Colab.
Conclusion
From this analysis, it would seem that GitHub has over time become a place with a strong community for open source and public development, which teaches programming and make its users socialize. Instead, the strength of GitLab would seem to be its support for private and enterprise projects, with excellent tools to manage their infrastructure and with advice on how to organize work in heterogeneous teams.
Would you like to see more marketing content analytics? If so, suggest in the comments which companies you would like me to analyze!
Have a look at other content marketing analyses!

Stay up to date with the latest stories about applied Natural Language Processing and join the NLPlanet community on LinkedIn, Twitter, Facebook, and Telegram.
