Collecting data for machine learning / creation of scripts

rethought · December 21, 2022, 4:51pm

Some people have tried working on automated scripting solutions. One of those approaches has always been “why don’t we throw a lot of data at a machine and eventually it will learn it”. I don’t think the concept of Artificial Intelligene (AI) and Machine Learning (ML) is new for most people in current days.

How to help

Post your scripts and source url on faptap.com. That is the quickest and easiest way to support this project without any effort.

Approaching this requires data. A lot of data. I built the pipeline to take in any video and its corresponding funscript file and extracing frames with their positional value, already. The positonal value is the position the stroker should take. 100 for all the way to the top and 0 for all the way to the bottom.

Sadly, this is not something a single person can do. There is a lot of data needed and a single video can only provide so much. A lot of frames will be similar and some variety is required, that is why I’m asking for your help.

I’m not expecting anyone to help me label and review the data, which sadly has to be done by hand. Crowdsourcing is not an option, due to the variety in data and people needing an instruction. Quality is key, but I need quantity and variety as well.

I’m asking for your help as enthusiastic eroscript members and script creators. There are a few things we can do as a community to push this project forward. For me and for anyone else trying. This can become a valuable source and dataset over time.

What you can do:

Take any video that interests you
Script a part of a video with very high detail
Post links to the most realistic and best scripts that you can think of.

If you are a creator and would like to contribute

Feel free to direct message me if you usually only offer paid scripts but woudl like to contribute. I’m sure we can find a satisfying solution and I can explain the process in more detail then.

Scripting a part of a video:

Detail is king. Every frame should be withing a small margin of error.
It does not have to work on your device. Too much detail in a short amount of frames can kill responsivenes of devices. This is not the case here. More detail is always better.
Please attach the funscript file and/or the video. A link to the video if its free to watch is fine as well.

Posting links to realistically scripted videos

More detail is better (that means high accurracy and also a lof of actions/keyframes)
Length doesnt matter
Variety in positions/scenes within a single video is good
Please post a link to the funscript and video
its ok if the video has bad parts, I built a tool to filter them out

Thank you very much for any help, even if its only a little. Any participation helps tremendously. I’m hitting a limit as a single dev and data collector.

I will try to update this post irregularly with what has been included already.

Why this is worth your time

Even if this project does not succeed, providing a place for others can prove beneficial in the long run. Companies like SexLikeReal need competition, otherwise the subscription prices will stay high. More scripted videos is always good.

Eventually the model will hopefully be capable enough to script videos in almost real time without much human input needed. Even when some parts are not optimal, it will greatly improve the speed of creating scripts. Creating scripts currently is a lot of manual effort and very very time consuming.

serendipitis · December 21, 2022, 5:29pm

Confused, there are 1,489 free scripts on the free scripts list. As someone attempting the same thing, why is that not enough for you to start?

rethought · December 21, 2022, 5:55pm

It is, most definitely. But not every script has the same quality and I do not have the time to filter through every single one. I’m already taking a look every now and then, but its very time consuming. Some community help would certainly speed this up.

I can imagine there are some power-users here, so to speak. These people probably have a much greater insight already than I do.

Also I want to highlight again that this is benficial to more people than just me.

hentaiprodigy69 · December 21, 2022, 5:56pm

It’s not enough for OP to start because a lot of the videos are low quality, not optimal, etc.
He wants to train it on a subset of scripts that are very high quality, from specific scenes.
I’ll think about this, but it takes time to clip and share scenes and share the funscript…

@rethought Do you have a github repo?
Also, will you be training on any hentai videos?
I’m thinking that POV BJ/Riding videos would probably be the best for training, though they might be too easy for when different videos come in.

I’ll try to keep track of what vids end up being my fav/most realistic and share them someday:tm:

rethought · December 21, 2022, 6:09pm

Initially I want to train on real humans. For hentai it could be beneficial to train a second model. I fear that there is not a lot of data available for hentai though and the limited amount of available data will negatively impact the model.

There is the possibility that with enough training one can apply transfer learning and “retrain” the existing models for humans on hentai as they share a lot of similarities. Initially that wouldn’t work out and confuse the model though. Unless we have tons of data, then it could generalize and work.

This is a lot of trial and erorr and seeing what works and what doesn’t with the data available.

Thank you very much, this was exactly what I was looking for. Its not a fast process, but we will get there, hopefully.

I do, its currently on private. I’m very much interested in open-sourcing part of the software, just haven’t decided where to go with it yet.

If it doesn’t work out I will definitely open-source all the important parts. If it works out, I could imagine building a small business around it or similar. Its just too time consuming and would prevent me from being able to pay my bills in the long run and I fear that open-sourcing everything or the model would prevent me from actively working on it in the future.

I do have some crazy ideas to get around the current limitations. There are advances in spatial AIs that take 2D images and convert them to 3D representations. I haven’t researched a lot yet, but it crossed my mind to use an approach similar to that and track points in 3D, as such extracting the information required to determine the position a stroker should take. That would generate more data to use for a model that only needs to see a 2D image. I know its possible, the question is: How much data is needed and will it have a high enough quality.

Slibowitz · December 21, 2022, 6:12pm

Sounds interesting. I definitely think that automatic script creation will be the next big thing. The whole video gets auto scripted and the creator just manually checks everything, adjust some strokes and maybe manually script parts, the program can’t automate.

Happy to help then.

Aisha Bunny - I always Wanted to Taste a Big White Cock - Naughty Step Sister gives me the best Blowjob

Very detailed blowjob / handjob. Lots of small movements I added in. Completely in pov.

Eva Elfie - Tries a Big Cock inside her Tight Pussy

Also a very detailed script. Especially the beginning has a ton of small details. Later on it gets simpler, with the riding sections. Also a full pov video. Make sure to take the updated version of the script.

If you need more video with a lot of details, let me know I see, if I find more. Same if you want some simple videos as well.

serendipitis · December 21, 2022, 6:14pm

I would guess, and it’s a guess, that all the dark data from those imperfect scripts deliver value. DeepMind is very good at harvesting imperfect data like that in their training pipelines. For example , the script for Little Asian Brat occasionally goes out of sync, but broadly it seems accurate. I do agree that the input data should be snippets both before and after the frame to be predicted, with funscript state supplied from all the preceding frames. Guessing maybe ~1s snippets myself, but I haven’t built my pipeline yet.

rethought · December 21, 2022, 6:30pm

Its a bit hard to say, if that also applies to this use case here. Depending on the project the model could learn to self-correct and some algorithms also include faulty data on purpose or to show the model how to “not” do it. Most commonly seen in validation and test sets.

It is important to carefully consider the impact of imperfect data on the results of any machine learning models that are built using the data. In some cases, the imperfections in the data may not significantly impact the model’s performance, while in other cases they may have a significant impact and may need to be addressed in order to obtain accurate results.

I have a feeling in this case it will negatively impact the outcome. However, I had some thoughts regarding preventing bad data in the first place.

Mainly the concept of data imputation: This involves replacing missing values with estimated values based on the other data in the dataset.

I’m already interpolating between two or more frames, if there is no data available. Say frame A has an action that is roughly 29ms “out of date”. frame B is 100ms in the future but the closes scripted action is 200ms further in the future. In this case it will interpolate the value and estimate a position. However, when reviewing the data I noticed that sometimes it would be off and not look right. Its entirely possible that it will feel right when using a stroker though.

This is pretty much what your stroker does as well, only it doesn’t attach the data to a frame but rather uses the interpolated value to control the motor.

Something that could help with “it doesn’t look right”, but feels right in practice is to train the model on a sequence of frames instead of prompting it to find the correct positional value using a single frame.

Actually I got inspired by the concept of springs used in cloth simulations. Without going into much detail: Basically show the mode one frame 125ms in the past, one 50ms in the past, the current frame, a frame 50ms in the future and a frame 125ms in the future for example. (In practice its an index of a frame, rather than milliseconds)

This would potentially direct the model in the direction of being able to deal with bad data and also have it develop an understanding of sequential data. That could significantly improve the estimations, as its now able to see “the bigger picture”.

rethought · December 21, 2022, 6:33pm

Thank you very much, this is exactly what I was looking for! It helps a ton to know where the good parts are and will save some time extracing information.

serendipitis · December 21, 2022, 6:52pm

Looking at human generated scripts, they seem far simpler than machine generated scripts using only 10 or 20 of the possible values in increments of 5 or 10. Your proposed spring approach seems interesting. I would also mess with the increments a bit during data generation to prevent overfitting. This would also serve to somewhat augment the gold data.

But yeah, after training a model. Go back and find out what parts of the training data are predicted badly and understand why to improve the next round. I’m with you 100% on that

This is going to be fun. Can we open the data here so we can all take our best shots here?

Slibowitz · December 21, 2022, 6:55pm

Sure thing!

Aisha Bunny script:

54 sec - 1 min
1:05 min - 4:27 min
4:35 - 8:43 min
9:24 min - 10:27 min
11:28 min - 11:58

Eva Elfie script:

First Blowjob / Handjob section:

30 sec - 5:10 min
5:23 min - 7:36 min
8:21 min - 8:39 min
8:54 min - 9:14 min
11:49 min - 12:00 min
12:10 min - 12:50 min

Riding section:
12:53 - 13:32 min
13:55 - 14:19 min
15:49 - 16:02 min

Last Blowjob / Handjob section:
18:48 min - 19:02 min
20:21 min - 21:08 min

Hope it helps!

rethought · December 21, 2022, 7:27pm

Absolutely, I’d support that. The question is… where do we put the resulting data set. Do you know some cloud storage that is preferably free for our use case? I currently store some labeled data in the LabelBox cloud. But I can’t just expose that without creating a costly team.

serendipitis · December 21, 2022, 7:41pm

Maybe just stick the metadata describing it up on mega?

rethought · January 30, 2023, 3:23am

Small update:

Reverse Engineered the FapTap API (it broke with the new update, sniff) and scraped the websites tags, source url and funscript file and other minor meta data.
Downloaded all the source data from the source url
Currently writing an algorithm that filters through it and looks for frames and funscript pairs, where it thinks its accurate
Then I will manually go through it and post the data set on mega

If you got any questions or suggestions, I’m always open.

anon32868679 · February 1, 2023, 4:35am

Nice work RE-ing the FapTap API. That’ll be a great source of data. You could potentially do something similar with this forum using scrapy to find low-hanging fruit. Check for at least one live link to a video on some pre-whitelisted sites (mega, anything usable with yt-dlp), download all funscripts in the thread and prefix them with the poster’s name for filtering later.

Ping if you need storage space

Also, I’ve got a decently sized index of scripts+videos. Copyright concerns notwithstanding (more the scripts TBH than the videos), I’d be happy to share via DM. It’s decently categorized which ought to help filter out low quality scripts. Some scripts don’t follow the action (PMVs, Cock Hero), and some scripters have pretty odd/bad scripting styles which will ruin your dataset. You could also potentially train different models for different video styles too.

I’ve got gigabit up/downlink. Wouldn’t take more than a day or two to send my entire collection to you. You can just jdupes away anything that’s a duplicate from FapTap, but most of these videos/scripts are not on FapTap, so I’m willing to bet the vast bulk of this will be new data for you.

These categories are further divided by the scripter. I don’t have time to get the lists though, so that’s not shown here. I mention it because if you want to discard low-quality scripters, it’s a pretty easy find -exec to drop their folders.

2D
- Amateur Asian: 16GB
- Amateur Ebony: 0.5GB (sadly lacking on this site fwiw)
- Amateur Whities: 54GB
- Cock Hero: 120GB
- Hentai (entire anime episodes and lots of shorter western animation): 120GB
- Hentai Cock Hero: 285GB
- HMV: 57GB
- K-pop: 5GB
- PMV: 110GB
- Professional Asian: 62GB
- Professional Whities: 38GB
- TOTAL: 947GB – 1189 scenes
VR. this is less organized. Probably 80% JAV, 20% european/american. It’d be fun to normalize this data for the AI. Reproject the view to equilinear would suffice. Or just train separate 2D and VR networks.
- TOTAL: 1.1TB – 165 scenes

I found a way to bypass the mega.nz limit - this was all accumulated over the past 2 months or so in case anyone is wondering why my collection is so big. Or at least seems so to me - I bet there’s a dozen people with 5TB collections or some shit. I only download what I like. Not hoard.

rethought · February 4, 2023, 5:23pm

Heh, I suppose you meant years?

Back to topic, that sounds phenomenal. You are absolutely right, a large majority of data is garbage sadly. That’s why I’m currently contemplating which architecture and type of training is suitable for such a project.

I already came up with some methods of optimizing for the lack of data:

First comes filtering and adjusting data:

Divide video into segments of 5 to 7 frames. (Exact number needs experimentation)
Determine accuracy of every segment (0 to 100%)
- If accuracy is above a certain percentage (e.g. 65%)
  - Save corresponding interpolated positional values and frames to disk
- Otherwise continue with other segments, and ignore this on

How exactly accuracy or “correctness” is defined is up to the implementation. Personally I determine the validity of every frame by simply looking how close it is to a “keyframe” or “position” in the funscript. If there is a keyframe within 100ms or less it is good, otherwise less good - to simplify it (a lot). This also needs some experimentation till there is a good balance between good segments and bad segments.

Next this data needs to be filtered manually (that’s were we are at currently). I’m also taking the time to correct some of this data. I am considering skipping this step and restricting the generates data so much that only a few hundred segments will be available. That will be around 5000 frames, of which a lot of frames will almost be duplicates. That is not too bad though, depending on the architecture of the model.

Data augmentation:

Data is further augmented by cropping onto the relevant areas
To automate this process a pre-trained model capable of few shot tuning can be used to segmentate the image and as such finding the relevant area
Still need more things here… maybe mirroring, black and white, reducing resolution and so on

Choice of the model:

I am looking into transformer models. There are some papers that showed promising results with few-shot learning (we only have a handful of images, typically around 5 to teach the model a new thing). This approach also has the benefit that maybe it will be possible to train on long sequences of data (e.g. almost the original video) as this model of training is quite good at self regulating less important aspects and weighting sections of the data less important than others. “Its good at prioritizing” so to say. But this all requires experimentation and I’m far from being an expert in machine learning - totally a noob to be honest. This is ambitious but I’m semi confident.
Then most likely rinse and repeat till there is something halfway decent

So right now I will continue working with the limited data set, till the requirements are more clear. Thanks though! Your help will hopefully be much appreciated in the future.

Rakly3 · February 6, 2023, 3:28pm

Who the hell scripts with 100ms deviation?
That wouldn’t even pass my QA.
That’s 3 dots in some scenes.

rethought · February 8, 2023, 11:59am

Please read again. This is a crude simplification of how the program determines how trustworthy a pair of video frame and corresponding funscript key is. Furthermore the value between two keyframes is interpolated. So even if a frame is between two keys, one being 90ms away, the other 70ms away. It will interpolate between both keys and therefore guess with a very high accuracy where the device would be in real time.

As you can imagine not every scripter has the same approach or quality to their scripts. Some will include mistakes or inaccurate data. In real life you often dont really notice, but when I take a look at an image and the funscript value and I see an grand canyon of a difference its not suited for training. Thats why I determined 100ms to be accurate enough. This is also looking at it in black and white. In reality its a math function that smoothly translates from good to bad and is not exactly 100ms but around that.

You are also ignoring that there are multiple frames in a segment. A single frame can be inaccurate no problem. Still good enough data. But I also look at the Standard Deviation, Variance and whatever else is interesting to determine a good segment as a whole, such that the majority of segments is good data. With my dataset I ended up with 50k+ images and 10k+ segments. That is enough for training for now and includes mostly good data.

Garbage in Garbage out. Less than 100ms with the approach I took will result in there barely being any data and more will result in garbage. Do you have a better approach? I’d love to include it, if that is the case.

Rakly3 · February 8, 2023, 5:57pm

It wasn’t a critique on your AI and data collecting, but rather that scripting with 100ms off, is terrible scripting. (excluding streaming delay.)

andyb08 · February 8, 2023, 10:44pm

Will this gather the multi axis data of the actors aswell. like from a vr scene