SCU DSP for matrix transformation?

I tried something to reduce memory usage and reduce the number of processes vertices, but it didn't work out too well so far, so hopefully someone can suggest me something to speed up the whole process before I just move on to another technique :

As mentionned in this post, I imported Quake maps in my engine for testing, and I subdivided all the maps in grids and planes (in other words, you have tiles of quads, but these quads all face the same direction and they are all located in the same grid-square from the global map coordinates).
This allows me to do aggressive culling, which is needed on the Saturn.

But this subdivision duplicates several vertices since it's how it works with SGL (each object containing its own vertices), to the point where it's be too much for the Saturn and it's too much for the default SGL workarea.

Long story short, I tried to generate a PDATA (the 3d mesh) on the fly by using essentialy lookup tables to determine if a verticle is already used, and change the quad's vertices reference with that lookup table.

The number of processed vertices is reduced a lot, but the whole process is much slower (which was expected, but not that bad).

I guess there might be a way to speed it up by making better use of the CPU cache or find a way to make DMA practical (it's not right now, with nothing larger than 12 bytes).

Now, I'm considering to revert back to the old technique of just using static PDATA and hope that with better hidden surface determination I can keep everything to managable levels, but I will still try a few more things before fully ditching the current technique.

Any suggestions on how to speed up the whole memory transfer/lookup ?

Code:
void COPY_POINT(POINT source, POINT dest)
{
    dest[X]=source[X];
    dest[Y]=source[Y];
    dest[Z]=source[Z];
}

void ADD_POINT(unsigned short i)
{
    COPY_POINT(VDATA.pntbl[i], LevelMesh.pntbl[LevelMesh.nbPoint]);  //Copies the vertices from the global list to the generated PDATA
    VDATA.WRAM_LUT[i]=LevelMesh.nbPoint;  //The indexed value where the vertices is stored
    ++LevelMesh.nbPoint;
}

void COPY_PDATA(unsigned int i)  //It's called for each plane after culling out those not needed
{
    register unsigned int T;
    short *WRAM_LUT=VDATA.WRAM_LUT;  //The lookup table for indexing vertices
    _QDATA * curQuad = QDATA[i];  //The quad data, containing the texture no and the quads' 4 points (pointing to the global vertices list)
    POLYGON * curPol;
    unsigned short * curVert;

    FIXED curNorm[XYZ];
    curNorm[X] = PLANE[i]->norm[X];    curNorm[Y] = PLANE[i]->norm[Y];    curNorm[Z] = PLANE[i]->norm[Z]; //I keep the normals per plane to reduce RAM usage

    for (T=0; T<QDATA[i]->nbPolygon; ++T)
    {
        curVert= curQuad->Vertices[T];
        curPol = &LevelMesh.pltbl[LevelMesh.nbPolygon];

        if (WRAM_LUT[curVert[0]]== -1)            ADD_POINT(curVert[0]);
        if (WRAM_LUT[curVert[1]]== -1)            ADD_POINT(curVert[1]);
        if (WRAM_LUT[curVert[2]]== -1)            ADD_POINT(curVert[2]);
        if (WRAM_LUT[curVert[3]]== -1)            ADD_POINT(curVert[3]);

        curPol->Vertices[0] = WRAM_LUT[curVert[0]];
        curPol->Vertices[1] = WRAM_LUT[curVert[1]];
        curPol->Vertices[2] = WRAM_LUT[curVert[2]];
        curPol->Vertices[3] = WRAM_LUT[curVert[3]];

        curPol->norm[X] = curNorm[X]; curPol->norm[Y] = curNorm[Y]; curPol->norm[Z] = curNorm[Z];
        LevelMesh.attbl[LevelMesh.nbPolygon]=ATTRIBUTE_LIST[QDATA[i]->Texture_ID[T]];
        ++LevelMesh.nbPolygon;
    }
}

Z-Treme Quake house 2.png Z-Treme Quake house 3.png
 
Last edited:
Sorry to resurrect an old thread, but I just saw something that might be interesting for you XL2: it's a video on youtube from Jon Burton (Traveller's Tales), the main programmer behind Sonic R, explaining how he made use of the Saturn's architecture, which includes the SCU DSP, with even a screenshot of the assembly code he ran on it. Apparently the thing could execute 6 operations at a time, so he had to code it in a very peculiar way to keep everything in sync. You can have a look here:



By the way, I am a big fan of your dev work on the Saturn, I am a programmer myself and have always been intrigued by the Saturn's "special" architecture and speculated on ways to make the most out of it, and after seeing the videos of your project it's like a dream come true haha. ;-)
 
  • Like
Reactions: vbt
Sorry to resurrect an old thread, but I just saw something that might be interesting for you XL2: it's a video on youtube from Jon Burton (Traveller's Tales), the main programmer behind Sonic R, explaining how he made use of the Saturn's architecture, which includes the SCU DSP, with even a screenshot of the assembly code he ran on it. Apparently the thing could execute 6 operations at a time, so he had to code it in a very peculiar way to keep everything in sync. You can have a look here:



By the way, I am a big fan of your dev work on the Saturn, I am a programmer myself and have always been intrigued by the Saturn's "special" architecture and speculated on ways to make the most out of it, and after seeing the videos of your project it's like a dream come true haha. ;-)

Yes, I saw it too.
Traveller's Tales' work is really impressive and I do hope to see a video about the SCU DSP.
It does support 6 parrallel instructions, but there is no division unit.
You can do something similar with the SH2 where you calculate some things before your division is completed, so it's hard to say if it's faster using the SCU DSP than just using the SH2.
The SCU DSP is also clocked at half the speed of the SH2, so that's something else to consider.
I think most people used it for lightning since it doesn't require divisions.
 
Back
Top