Protecting PDF files

Hi, in addition to members-only web-page content, we need to now also add PDF files as members-only content. In our case, these are PDF versions of the members-only articles.

As I understand, MS uses Javascript to hide protected content – and I’m not completely clear whether this means that all protected content is also hidden from search engines. Similarly, will Google be able to find the link to the PDF (and the PDF itself) if it is within the protected part of the content; will it crawl and index any such PDF??

The site is hosted on Webflow, so we don’t have access to a .htaccess file.

A suggestion I’ve found online is to block access through robots.txt

User-agent: *
Disallow: *.pdf # Block pdf files. Non-standard but works for major search engines.

“The problem is that this only prevents Google from accessing the PDF files but does nothing to remove them from its index or being listed in search results.”

This is what I’ve read online.

So, what is the correct and 100% secure way to exclude protected PDFs? We just cannot have these PDFs appear in a search engine or be found any other way.

Thank you.

1 Like

Great question! I think a single line of code and MemberStack’s default functionality will be enough to keep Google away. The following is from Google.

To prevent most search engine web crawlers from indexing a page on your site, place the following meta tag into the <head> section of your page:

 <meta name="robots" content="noindex">

By default, MemberStack redirects search engine bots via javascript. Meaning, hidden content should not be accessible to search engines anyway. For good measure, we also inject the meta tag above into the page. That said, I still recommend adding the meta tag directly to your hidden content.

Thank you for your reply, Duncan. You make a couple of points on which I’d like to get back and clarify:

This seems to apply to pages that are not to be indexed. This is not what we require. In contrast, our pages with members-only content also have public content on them, so that should be crawled and indexed and links should be followed. What we need is for the linked PDFs to be hidden/excluded together with the members-only content. I don’t see how adding that meta tag would address this.

I cannot see that meta tag when I inspect the page with members-only content on it.

1 Like

Hello!

Where are your pdf files hosted? Right Inside Webflow?

1 Like

Yes, they are to be uploaded via the Webflow CMS. So a PDF’s URL would look something like
https://uploads-ssl.webflow.com/582f21bd539fedd346d8937a/5d8bc07a2a544b7d549bd50a_filename.pdf

…at least that’s the plan.

1 Like

I’ve just checked again on this one: This doesn’t seem to be the case for any pages that haven’t been hidden via the Hidden Content/Pages section in the MemberStack dashboard. As said, our pages are public but have members-only content on them – and they do all appear in the Google Search results (just as we’d want them to).

In addition, it is the case that Google has crawled an indexed all the member-only content on these pages: If I search for a specific phrase from the members-only part of the page, Google shows exactly that text in the search results.

Therefore, the question is: As google has access to the on-page members-only content, how can we prevent Google and other search engines from accessing and listing any PDFs that are linked within the members-only section?

Just to repeat that these PDFs are full versions of the member-only articles, optimised for print and for members to download for reference.

Thank you.

1 Like

Got it! This make total sense now. I’ll ask Google…

In the meantime, I think you’re still in luck. MemberStack actually removes any content with an ms-hide attribute. Assuming you’re using ms-hide attributes, your pdf’s will be gone from the page before a bot has time to crawl them. That said, it will be a good idea to find some code that lets these bots know to stop indexing or work around a particular element.

https://perishablepress.com/tell-google-to-not-index-certain-parts-of-your-page/

Can you share a text blurb from a PDF that should not be visible to Google? I’ll do some searching and let you know if I find anything.

Thank you for the reply.

Yes, we use ms-hide on different elements on the page. Inspecting the code once the page has loaded, it certainly seems as if the protected content isn’t part of the HTML any more. HOWEVER, when loading the source code in the browser “view-source:https://…” the protected parts show up.

Also, when I use a View as Google Tool, e.g. https://totheweb.com/learning_center/tools-search-engine-simulator, the protected text also shows.

In addition, protected text shows in Google search results when I search for a specific phrase.

Firstly, these are very simple ways for anyone to circumvent the protection and view the supposedly protected text matter, plus it doesn’t really support your statement that “pdf’s will be gone from the page before a bot has time to crawl them”. If the rest of the protected page is currently seen by Google, so will any link address to a PDF that’s part of it, won’t it?

Do you mean a text bite from a supposedly protected PDF that’s online already? No PDFs have been put up yet because we’re waiting to have clarity on this before giving the green light to the site owner that they can upload.

[Edit]
Maybe adding rel="nofollow" to PDF links would be the thing to do if there is nothing else?

1 Like

Good news! We found a simple solution.

I assume your current UX is to click on a PDF and have it open in a new tab. So! Instead of opening the PDF link directly, I recommend you create a CMS template with a full-page iframe. You can then pop the PDF into that iframe and hide the entire page. This also means you can place the following code in the header:

 <meta name="robots" content="noindex">

No more indexing!

If you need help reworking your site to accommodate the changes I’m happy to jump on a call. I’m sure you’re ready for this to be done!

By the way, I don’t think that View as Google Tool is accurate. A number of other articles including Google say Google has been respecting JS and other SEO meta instructions since 2008. I see several articles that say otherwise, but they are using JS in different ways than MemberStack.

Thank you for your reply, Duncan. Not sure I 100% follow.

But I’m already using the CMS template for this collection for the non-PDF content. I don’t have another template to play with as Webflow only allows 1 template per collection.

Also, the PDF’s supposed to be for downloading and printing. So that wouldn’t work from an iframe, would it?

1 Like

Great question! Could you DM me a read-only link for you site?

I see a few options, but the best depends how things are setup now. Would you be open to adding another CMS collection for PDFs, and using a reference or multi-reference field to link them?

Yes, would be open to that. I will DM you.

1 Like

Hi @spirelli - did you manage to find a solution to this in the end? I have exactly the same issue.

1 Like