Complications of AI training. Any solution?

Wordpress com is training Ai with its user’s content and that has already made visual artists to abandon wordpress free hosting and move to other platforms. Including Classic Press of course.

The problem though is that a good number of the other hosting platforms are doing exactly the same. They sell their users’ data and content to the companies that develop artificial intelligence algorithms. And when they don’t sell it the content is either way exposed to the AI bots that scan the websites for this purpose.

This as you understand violates our intellectual rights on our website’s content no matter if that is photos or original artworks or essays, texts or blog posts.
Is there a way, a script or something that can block the AI bots from scanning the websites?

In your robot.txt file you can insert these rules:

User-agent: CCBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /

However, I don’t know what impact it could have on SEO.

1 Like

The code @iljester gave you protects you from chatGPT based models, Open AI is not the only player scraping the internet however, but with a search in the web you can get the identifiers for the bots of the other companies (Microsoft-bing, Apple, Claude AI…) and just in the same way you block ChatGPT you can block them.

On a SEO level however there is not a clear answer if blocking them might hinder it, so if you do block them I suggest you start with the code above and see for some weeks how it goes, then adding other blocks if the block is not affetcing SEO performance for your site.

looking online I also found this article, talking in depth about how to secure your site from being scraped by AI.

Again, take with caution and if you do block the AI bots, do so in a way that you are able to check if they damage your SEO like I said above.

1 Like

Yes, there are others. I haven’t done an in-depth search (which I will have to do for myself too, to avoid scraping my content), but it goes without saying that ChatGPT is the most used. But it is good practice to cover with all the AI ​​present on the web: a real infestation.

1 Like

Hi and thanks for the suggestions.
The problem with robots.txt file is that not all robots respect the rules and also that if the companies that develop AI want to surpass such kind of rules can easily combine the search engine with the AI bots. I mean not being able to have the one without having the other.

I’ve made some search myself too and these guys https://cara.app/explore
are using this add on
https://glaze.cs.uchicago.edu/
Glaze - What is Glaze
for masking /altering images.

But the thing is that you have to use this masking before you upload the image as an extra layer on the image. But what about the already existing images on a website and the rest of the content? Text etc?

I’m so disgusted of how large companies are using our work that I’m ready to send my website into the dark web. To make it available only upon request and only if someone knows its exact url! At the end of the day my website is just a portfolio one.
It is not that I’m able to sell anything from my website nowadays and the random traffic is more of a nuisance than a way to attract customers.
I have to block pinterest, block social media crawlers, now block AI bots, bad bots, spammers …you name it.

The more I think of it the more interesting looks the idea of sending it to the dark web, print the url on my cards and let it be there only for those who know its url. It is a portfolio website after all. I want to showcase my work not to attract clients.

1 Like

What is going to happen if I remove completely robots.txt?
What if I make the website accessible only through links spread here and there, on the web and hide it from crawlers and bots?
Will I be able to make it disappear from search engines and their damn bots or the already scanned images will stay there for ever?

And here is something else. I don’t use SEO add ons, plugins and such things. I use a lot the meta data, tags and such things that help bots identify my content, that I guess that won’t be needed in the future if AI is able to figure out what is what and bring it up in the searches ( along with all junk available).

Generally speaking it is a mess. The old non AI algorithms that were already messed up with SEO, ( that can be manipulated professionally) contradict with the AI bots.

Going to the darkweb won’t help you. In fact, it would probably make the situation even worse: crime hangs out there. If you publish something on the web, it will always be found by BOTs, which scan the web. Also many people may share your blog on their sites. Therefore, hiding on the web while remaining on the web is pure illusion.
AI technology, which I consider harmful and invasive, is relatively new and therefore the remedies to defend against scraping are new. But over time they will increase. Perhaps there are already firewall plugins that prevent AI bots from scrapping. I have a profile on DeviantArt, and there the works are protected by AI. However, I don’t know what technology they use.
My advice is not to worry too much about this problem. Furthermore, you sell physical works from what can be seen from your blog. Try the robot.txt rules published in the link posted by Elisabetta Carrara and see what happens. It is difficult for now to know what the best defenses are to avoid content scrapping.

2 Likes

I second @iljester about not worrying too much. I can understand your concerns but as he noted in time more tools to avoid AI scaping content will arise.
Also, as you note, AIs are slowly integrating with browsers and this makes it very cumbersome to find quick ways to defend sites from them.
AI lacks as of now of clear rules and laws to avoid it breaking copyright laws, but in time we will get there.
Putting a filter on images, that is the old but gold watermarking revisited, might be a solution, you could ask the devs of that tool if they intend to provide a system for CP users to use their technology from within CP via a plugin.

1 Like

@Marialena.S I also found this one that seems promising.

Dark web is not only the .onion websites that can be visited only with Tor Browser. It is any website that can’t be intexed by the mainstream search engines.
Universities’ private databases and online libraries that are available only for their students are in the dark web. There is no way to know what is in the library or the database of a university with a regular google or other engine search. You have to have access to their website and then use their internal search engine.
My idea is to remove my website from the search engines and make it available only for those who have its url or those that I personally give my url with my card or other way.
And that because I don’t care about the random traffic that supposedly increase the ranking of my website on the web searches because this sort of traffic never brought me any customers not to mention that this is not the purpose of this website. It is a portfolio website, the easier way to show samples of my work instead of having to show them in physical form.
This article is very informative btw.

Be careful with DevianArt. It might protect your artworks from being scrapped from AI but it doesn’t protect them from being downloaded on everyone’s computer. You can download anyone’s image by right clicking on it, open it on a new tab in full size btw and then download it normally.
Generally speaking DeviantArt is not a particularly good website. It has way too many users were 90% of them are copying each other.

It’s actually been a long time since I posted anything on DeviantArt. Also because I also create physical rather than graphic works. On DeviantArt I only have minor vector works, that is, created for study and practice. Plus some unimportant flying drawings. However, DeviantArt provides watermarking and even though you can download the image by opening another tab, the image comes with a watermark. This is the example

It is not that I’m concerned about AI only. It is the whole way that search engines present my website’s content.
The whole indexing system with the SEO and algorithms stuff creates a fake ranking competition between websites that attracts the worst of bad players and by no means those that are good and needed to those who run the websites.
No one is going to find my website if s/he doesn’t know my name in order to search and find it instantly. But those who know my name, know already my website’s url too simply because it is my name! So why do I trouble myself to fine tune the robots.txt and the SEO of my website when it won’t appear in the first 100 results of any search engine, if someone is looking randomly to find some Greek watercolour artist. In fact I’ve never seen anyone searching this way. They go straight for the name or if they don’t remember my family name with something like Marialena Greek artist or similar.
The random others are 99% of the times irrelevant, bots, spammers and such things to waste my time on them or to consume my bandwidth. That is unlimited but … I can’t see why should it be wasted this way?
The ai.txt is quite promising btw. I’ll check it out to see what I can do with it. Thanks. :slight_smile:

Those who download your images don’t get them in order to repost them but in order to copy them themselves. Copy the idea, the composition, the lay out the technique etc.
I’m of the opinion that it is better for visual artists of any kind to limit the places they show their art.
I post my works only on my website so there is no way not to know where someone else found my artwork images. Now if those who repost my images reference my name or my website I let them be. If I catch anyone using them without any reference I demand to remove it immediately.
I have also blocked the right click copying option from my website, I don’t allow pinning on pinterest, I don’t allow hotlinking and I upload only very low resolution images. Less that 50 kb per photo. And that has worked well for me up until now.
The past 11 years in order to be more accurate.
At the beginning I used to run after the traffic and the ranking but then I realized that this wasn’t the purpose of my website and that wasn’t the purpose of my art at the end of the day. To satisfy the requirements of some always changing algorithms.
And that is the reason why I’ve never used any social media in order to boost my art this way. My traffic is mine. It doesn’t come from anyone else’s website or platform.

OK. You did very well. The chords they use are good. But if someone intends to copy your work, perhaps reworking it, even a low-resolution image with a watermark is enough. So, it’s like trying to scoop up the sea with a teaspoon. Your works have value because they are physical works. Only images circulate on the web, not the works themselves. The truth is that if we really want to protect our works from plagiarism, we should not publish them at all. But if you don’t publish them on your website or wherever you think is most appropriate to publish them, no one can appreciate them and perhaps buy them.

2 Likes

It is not only plagiarism ( that is what AI does btw) it is the reproductions too.
I caught once, years ago, some Chinese that had printed one of my paintings on napkins! And the way they did it was by taking one of my photos that was pinned at pinterest! That was before blocking pinterest. They wouldn’t be able to find my website with any other way, because my website didn’t appear on Chinese search engines’ results, Baidu or whatever is the name of their search engine.
Artworks have value in their physical form, but their reproductions have value too by the time that it is supposed that you are paid to license them for reproductions on things like napkins and tablecloths! Anyway … at some point I’ve blocked Pinterest and registered my website on Baidu, in order to get rid of the first and have records that the website is mine on the latter!
Internet is vast but it must have some rules too in order to be really useful for professional use.
But at this point is simply a mess. It is not what it used to be, it has become very corporative ( is there such a word?) I mean is infested with large corporations that are trying to put everything under their business model, make the whole internet work with their own terms and if you don’t like that model you have to run constantly to fix things. Like prevent AI from scrapping your content in order to learn how to plagiarize it! Which is tragicomic as an idea because it is not what AI was supposed to be made for.

As the author of ImagePress, a multi-user image gallery plugin, I can advise you to hide your images behind a login.

If you have your own website where you post images and artwork, why not create a free membership module and let users create accounts in order to view your art?

This way, you can also create an email list, which you can use later to promote more of your art.

Unfortunately, whenever you put something on the internet, you give up all the rights. Licensing, watermarking, asking nicely doesn’t work. This has been the case since the beginning of time. There will always be someone to steal your data.

robots.txt won’t help you if thieves set their scrapers to ignore it.

Removing robots.txt and posting links here and there, won’t help again, because Google will find and index your content.

Setting watermarks or image overlays won’t help you because I’ve seen cases where the AI has been asked to remove the copyright, and it did a great job.

I stopped putting my fiction online, as I tried to self-publish it at some point, and Amazon Books said that my content already exists out there. And it was written by me, even before OpenAI existed.

3 Likes

Your comment is very interesting and thank you for the suggestions.
My idea is to disallow the search engines and give link towards the website only at the places that gather art people which makes sense by the time that this website is targeting a very specific group of people. IOr alternatively give the link physically printed on my cards, only to individuals who are interested to see my work.
How will Google be able to index a website that disallows all search engines on robots.txt even if there is a link somewhere online?

Creating a membership module defeats the purpose of a portfolio website. I mean to invite a potential customer to see your work and then asking him to sign up…hm…it won’t look good. In the best case scenario will look way to elitist and pretentious and in the worst case scenario will look like a scam or something.
My ( potential ) customers will wonder what the heck I’m hiding behind passwords and they might expect something extraordinary and then get disappointed. And it is quite difficult to explain to people that are not familiar on how IT work why there is a need to hide your website’s content. Not to mention that this might attract the “curious” ones and then have to approve countless member accounts from people that I don’t know where they came from.

TBH I don’t know how to set this up and if it can work the way I want it but the idea to target the group that I want them to visit my portfolio website makes sense. That is the people that I want to visit my website not the whole universe. The random high traffic that supposedly advances my website on the search engines ranking doesn’t serve the purpose of this website.
I don’t need the “cheap T-shirts made in China” and the “buy Viagra online” to hit my website or comment on my blog posts in order to advance its ranking because I know that this will never happen the way this indexing and ranking system works.
Not to mention that I had to deactivate the comments on older blog posts because I got tired to have to moderate comments from such sources.

But I have the notion that eventually we will all be forced to withdraw, to say it so, our websites from public view. At least those of us who know what they made their websites for.
It is a different thing to have an online presence, it is convenient to start with and looks professional, and a completely different thing to have your work exposed to anyone who wants to steal it, scrap it, copy it, plagiarize it or even copyright it under his business model.

Under Constructor plugin lets you redirect visitors to a static page but you can provide “magic links” that let some users see the site.
The idea was to allow clients to see the progress.
The magic link is pretty long, but you can provide a QR code.

1 Like

because Google basically ignores robot.txt - sorry to say that, but the idea that disallowing search engines (not only google) is going to keep them at bay is a myth per se. Disallowing them only means that they take longer to index pages, not that they won’t. Usually to index a normal page that is allowed to be indexed it can take from 4 up to 6/8 weeks. this time is standard indexing time for Google. When you set to disallow it might take more than that but eventually you get indexed.

On the internet, given time and resources, everything can be accessed and copied.

Now about building a restricted content area, that might be an idea because to really see the works one has to be registered. but again a password protected area can be hacked. Sites allowing registrations to restric content or that use a password protected page are subject to be attacked. Also as you note putting such a fence is going to discourage potential customers. My idea is that such a portfolio of artworks should not be on the web if the concern is it is going to be copied or stolen. You should have a way to tell potential customers that IF they want to see it you will be glad to send them a pdf presentation with the most stunning among your artworks.

2 Likes