🚑 Sam Altman wants to make us healthier with AI

This issue is brought to you by:

AI-HOI AInauts,

Welcome to the new issue of your favorite newsletter. Not much news today, but a lot of practice on the subject of scraping!

If you don't think this is relevant for you or doesn't mean anything to you: Please read on anyway and let us convince you otherwise. It's an incredibly useful skill!

Alright, let's jump in:

🔒 AI scraping: How to protect your website from the bots
🤖 How to scrape data to train your AI tools
🚑 Sam Altman wants to change our behavior and make us healthier

Let's go!

🔒 AI scraping: How to protect your website from bots

In order for our ~~beloved~~ AI models such as ChatGPT, Claude & Co. to become really smart, they need to be fed with data and knowledge. From the internet, of course. And how? By sending out an army of small robots or bots, surfing the web, sucking up and storing the content of all websites.

You’ve heard it before; the whole thing is called web scraping or crawling.

So far so good. Perhaps you can imagine that many website owners don't like this. And that's also a reason why there are various lawsuits against AI companies, often brought by media companies.

How to protect your website from AI bots and scrapers

If you don't want your website to be crawled by scrapers, you have to do something about it yourself. And there are many of these little bots, as you can see in the graph below:

As you can see, the biggest data collectors are ByteSpider (by the Chinese company ByteDance, which also owns TikTok), Amazon-Bot, Claude-Bot and GPT-Bot.

The problem is often that only a few large providers - OpenAI, Google, etc. - label their bots properly. But there are also countless other bots that pretend to be just normal web browsers.

Currently, the easiest way to protect your website from these bots is to use a service called Cloudflare.

Cloudflare is a provider of website security and also offers faster loading times. It already provides very good protection against various hacker attacks, such as DDoS attacks etc.

Cloudflare now also allows you to block all AI bots with one single click. This is based on a proprietary machine learning model that also blocks the bots that pretend to be regular web browsers:

Bots blocked: Will all AI models now remain up to date?

As website operators, we naturally like the Cloudflare service.

As AI model users, not so much. Cloudflare is a really widespread service, super easy to implement and already 80% of the surveyed Cloudflare users want to block bots with it.

And as a result, the new models might lack new data.

It remains to be seen how big the impact will really be. There will probably be a cat-and-mouse game between scrapers and website operators.

In addition, many AI providers are already licensing content from media companies and online platforms. OpenAI, for example, has agreements with Axel Springer, TIME, Reddit, Vox and many others.

Or you can do it like Microsoft, which simply bought the GitHub platform for a few billion back in 2018...

🤖 Practice: How to scrap data to train your AI tools

Now that you've learned how to protect your website, let's talk briefly about how you can scrape content yourself.

But first: Why is this so important?

We talk a lot about context. In other words, examples and information that you integrate into your prompts, etc.

The better the AI models know what you want, the better the answers will be.

This is especially true if you want to train your own GPTs or chatbots for specific use cases. You need to provide data and examples!

For instance, if you want to have your emails automatically answered, it makes sense to provide your existing email responses.

Or if you want to build a chatbot that handles your customer support, you first need to feed it with your knowledge base, rules and additional data.

In short, data and examples are always at the beginning of any good AI workflow.

Scraping itself is a science and can be very complex. But to get you started, here are 3 simple ways to scrape data and use it for your AI:

1) Save entire websites with all their subpages

If you want to save the content from entire websites, we love Simplescraper!

Simply enter a URL and the scraper will pull all the pages and export them as a structured JSON file.

Note: The free version only works for up to 159 pages. For larger websites, you will need a paid version.

2) Scrape individual pages or the top 5 Google results

The reader from Jina AI is an extremely powerful and inexpensive scraper. Especially if you only want to save a single website or just the first five results of a Google search query, including the content, Jina is brilliant.

For individual websites, simply add the following link before the URL:

https://r.jina.ai/

Here is an example with one of our articles:

https://r.jina.ai/https://www.ainauten.net/p/moshi-voice-chat-claude-ant-hack-ai-news

You will then receive the page formatted in Markdown - perfect for LLMs:

Another cool feature is that you can use Jina to automatically pull the top 5 Google results for a question, including their content.

Simply enter the following URL followed by your question:

https://s.jina.ai/Who are the AInauts?

As a result, you get the content of the first 5 results of the Google search, which you can enter directly as context in the AI model of your choice.

3) More complex scraping challenges

As mentioned at the beginning, scraping can be very complex and challenging. But don't worry, there are also countless tools for more extensive projects.

Scraping is one of the oldest disciplines of the World Wide Web, because Google does exactly the same when they are building their search index.

Apify is a tool with endless possibilities, but still can be used by mere mortals like us. And the cool thing about Apify is that there are already a lot of pre-built scrapers.

These pre-trained scrapers are extremely useful because they already understand the structure of the associated web pages. And you can also integrate Apify into your automations via API in Zapier, Make etc.

Give it a try, it is a great skill to have!

🚑 Sam Altman wants to change our behavior and make us healthier

We actually had a third hands-on topic up our sleeve, but as this issue has already become quite extensive, we will end with a short news item.

Sam Altman, CEO of OpenAI, and media mogul Arianna Huffington published an article in TIME magazine a few days ago.

— # (#)

In it, they explain how the newly founded company Thrive AI Health wants to help prevent or treat chronic diseases in the future.

And in this case, AI is not simply supposed to discover new medicines, but to bring about a change in the users' behavior!

It is not yet clear when the associated app will be released. But they have been able to recruit the experienced Googler DeCarlos Love as CEO. He has already worked on Fitbit and other wearables.

This is very exciting! Until now it has been extremely expensive to afford personal coaches and specialized doctors - this could change the game.

If anything, social media has made our behavior worse, and perhaps AI will be able to steer our habits back onto a better track with the right impulses.

Super cool story, click here for the article.

That's it for today. We hope you liked it - see you next time!

Reto & Fabian from the AInauts

P.S.: Follow us on social media - that motivates us to keep going 😁!
Twitter, LinkedIn, Facebook, Insta, YouTube, TikTok

Your feedback is essential for us. We read EVERY comment and feedback, just respond to this email. Tell us what was (not) good and what is interesting for YOU.

🌠 Please rate this issue:

Your feedback is our rocket fuel - to the moon and beyond!