Anyone with web scraping experience want to help archive old RTS scripts?

Continuing the discussion from Best practices for scripters migrating their old scripts to EroScripts?:

I think a project where we try to archive the script posts in RTS would be extremely valuable. We’d need someone to help figure out the web scraping as well as how to store and serve this archive.
It also beats manually trying to migrate every single script over from the last 7+ years.

Right now I don’t have the time and I’m lacking in experience to handle this myself.
If anyone is interested in helping out, please leave a reply

1 Like

I’ll give it a shot. Do you want to list the old RTS posts in a certain category?

1 Like

@hugecat might have something else in mind but I think script-creator-portfolios would be a good category for this. Scrape RTS for scripts and organize them here by creator. jacecolm for example said that he doesn’t have the time to do this with his scripts but said someone else could for him.

1 Like

Thanks for volunteering to help. I’m open to ideas on how to best handle this.

Did a google search on importing topics into Discourse using files. Seems like it’ll require additional work for data formatting and/or writing up an import script. How to import posts in CSV format? - #2 by pfaffman - support - Discourse Meta and Topic and Category Export/Import - developers - Discourse Meta

If we do end up importing posts directly into this forum, I’d imagine creating a separate category like RTS Script Archive or something.

Alternatively we could scrape the posts+attachments and store them somewhere outside of Discourse.

1 Like

Oooh @xrobin that’s a much better idea that what I had tbh.

I’m not sure if there’s any issues we’ll run into there regarding max # of attachments or max character length in a post though.

1 Like

Yeah, good point about length. Husky did a nice job and he has a pretty long list but jacecolms is much longer I’m sure. Maybe if we run into a limit we could try a wiki post or just multiple posts in jacecolm’s thread for example?

1 Like

Husky shared all his scripts via Mega to get around any attachment limit. We could also try attaching a zip file of multiple scripts if we don’t want to use Mega. (idk what the file size limit is on attachments so it might take more than one)

1 Like

What about older posts from scripters who possibly only shared a couple of scripts. Those should be saved as well right?

1 Like

If it seems like they’re not going to be active on here and they’ve only put out a small handful of scripts, maybe there could be a Misc Archived Scripters thread organized like:

Scripter 1

  • Script
  • Script
  • Script

Scripter 2

  • Script

Scripter 3

  • Script
  • Script

etc…

From what I can see in my settings, there’s no attachment max. There’s a post character limit of 32000, but I have control over this. Again, not sure if we’ll run into some hidden upper limit in the software.

Yup, don’t see why not

@xrobin for every script post, there’s going to be a post title, post description + links, maybe some images, attachments. There are also post replies, and I don’t actually know if RTS supports attachments in replies. Replies sometimes include things like updated video links when the original fails.

If we chose to include all of that above information (ignoring replies for now), it seems like a single mega post will be too large to navigate. I guess you could use anchors to create a table of contents and link to different anchors.

1 Like

I wonder if the most straightforward thing to do is scrape all the data and dump it in a public S3 bucket. Then create a super barebones front-end to explore that old data.
You do lose out on integrating old posts directly into the new site though.

EDIT: Another thing I forgot is that sometimes RTS script posts have no attachments. They may just link to 1 or more megas that contain the video/script files.

1 Like

Yeah, the way Husky shared his archive is pretty compact since it’s just a list of titles with video and preview links, and then a single mega full of all the scripts. I’m not sure how we’d scrape or host the scripts but we’d need to figure out something. If possible, it’d be great to do something like what Husky did except with embedded previews, but I don’t know how do-able that is. If we wanted to include a description, indented under each title, that might be do-able too. I see what you’re saying about scraping replies for mega links to videos. Those are important but I’m not sure how we’d scrape them.

1 Like

I think i found a method to scrape the entire free funscript section on RTS. It should give me a list of every post title, author, description, images and links.

2 Likes

It’d be cool if we could archive portfolios like Realcumber’s here, with embedded previews https://realtouchscripts.com/viewtopic.php?f=63&t=7163. It makes the post/thread really long but it’s nice to just scroll until you see something you like.

1 Like

We definitely need to handle free funscripts and free vr funscripts section. Is it necessary to scrape the paid sections though?

1 Like

I’d tend to think it’s not as necessary to scrape the paid section, since those are going to be browsable on SLR, RS, and CzechVR. I guess it might be nice if someone here is searching for their favorite actress or studio and this could be a one-stop source for searching all scripts, but I suspect many pro scripters may get around to their own portfolios at some point.

1 Like

Starting with a seperate category “RTS Archive” or something like that could be a start.

Maybe subcategorizing into VR/2D/JOI/Hentai/SFM like Husky did?

1 Like

script-creator-portfolios is already a category for RTS Archives. Do you have a reason in mind for why we should make another one?

1 Like

Ah alright my bad, i wasn’t thinking straight.

Im scraping 1966 RTS posts containing; title, description, images and files. Where would you guys like me to post them once i scraped them all? I should have them all finished tomorrow.

1 Like

So, what I’m thinking is, once you have all that data, divide it by author, then start with the most prolific authors and give them each a thread in script-creator-portfolios. The less prolific, who have only made a handful, might be able to share one thread titled something like “Misc RTS Archives” within the script-creator-portfolios category, and within that post we could still organize them by author, but all in the same post.

Another thing to keep in mind is that before we start posting the data, maybe we should make sure the author is okay with us doing it for them. We know jacecolm has given us permission and he’s one of the most prolific so we can start with his. If the scripter has been MIA for a long time and is not on this forum, then we can assume it’s fine to post their portfolio but we should probably ask if it’s someone like Realcumber or Evernessince for example.

Edit: Then again, if we want to post everybody’s without asking first, it’s easy enough for us to remove it later if they ask because they want to post it themselves.

1 Like