Kolibri Performance Issue

We are experiencing performance issues with Kolirbi. We are on v0.17, we are using the postgres SQL.
The Server is 4-core server with 16 gb ram, the postgresSQL is 2-core with 8 gb ram

Hi @safi_khan,

Could you describe in a bit more detail what kind of performance issues you are seeing?

What is happening for end users when this happens?

How many simultaneous users are accessing Kolibri when these issues happen?

What is the CPU, memory, network, and disk usage during these times of performance load?

Please let me know any other details you can provide!

Kind Regards,
Richard

Hi Richard,

There are a lot of simultaneous users (I am not sure how to find number of simultaneous users, can you help me determine this?)
The server’s CPU, memory everything looks normal and well within the range at the peak times.
At peak times, the kolibri admin panel will freeze and will not load the “Coach” tab at all and on the client side, the courses will not load or take too much time to load the course.
We host it on aws and here is the screenshot of the system


(This is last 3 days)

Hi @safi_khan,

There’s no in built way to determine this, I assumed you might know from who was accessing your Kolibri server. I may also not be totally in the loop about your deployment model, is your hardware being accessed locally, across the internet?

A quick way to look at this might be to download the Content Session Logs for a particular time frame, and see how many simultaneous content sessions you are seeing in there.

It does seem like CPU is maxed at 25%, which makes me assume for a 4 core machine that it is only using a single core.

I would recommend using our kolibri-server package, which provides an out of the box higher performance setup that should fully utilize available CPU, using uwsgi and nginx as a reverse-proxy.

https://kolibri.readthedocs.io/en/latest/install/ubuntu-debian.html#higher-performance-with-the-kolibri-server-package

Kind Regards,
Richard

We host our kolibri server in aws environment with postgres rds instance and ec2 instance for hosting kolirbi application. We have about 2k - 5k users accessing kolibri. Let me try installing the kolibri-server and see if that improves the quality

@richard Is there anyway to use the python installation to install kolibri-server instead of Debian/ubuntu installation? I am trying to install it on a RHEL based linux

Ah, no - sorry, we don’t have a non-deb installer for this, as it is mostly nginx and uwsgi orchestration.

In that case, I would recommend you look at the source for the package: GitHub - learningequality/kolibri-server: A performance-boosting access layer for Kolibri with multi-core support and improved caching which contains the templates nginx and uwsgi configurations. You should be able to set these up independently in a RHEL installation, and then configure them appropriately using this as a guide.

@richard Just wanted to provide you an update, we switched kolibri to kolibri-server and monitoring the system now.

The classes report tab still takes a long time to load the page, is there something we can do about that? This is our options.ini, maybe we can tweak the system somehow?


It took 500 seconds to load the reports tab

Also, we have 42k users in our system (not all are active), is there anyway we can find the inactive users and delete them? (We are using postgresSQL db

For the reports tab, how many users are in a single class?

For deleting inactive users, we don’t have a great workflow for identifying them, but you might be able to use a combination of features to determine this.

In the facility data tab, you can export a CSV of ContentSessionLogs filtered by time - so if you only want to retain users who have been active in the past ~6 months, you could export the logs for the past 6 months. Each row in that export CSV will have the associated user id - in the same data tab, you can export a CSV of all the user accounts. With some spreadsheet lookups, you could probably then filter that CSV of all user accounts down to only those that were included in your Content Session Log export.

If you then reupload the users CSV using the import users functionality in the same Facility data tab, then it will delete the users you have removed from the CSV.

This is a little bit roundabout, but feels like the easiest workflow without diving into the Python shell. It would definitely be possible using the latter, and if you are familiar and comfortable with Python and the Django ORM especially, then you could do it that way instead.

We have multiple classes (about 20 classes) and each class has a range of users from 350 - 1500 users, we create the same lessons for each of these classes using multiple channels on kolibri.

Were you able to look into the options.ini file screenshot that I sent earlier?

I did look at the screenshot - I very much doubt any tweaks there would make any impact on the responsiveness of the coach reports page. The page loads a summary per class of all of the learning data - this is generally designed for an in-person class size of 30-50 students, for which we have been able to achieve reasonable query performance, at the class sizes you are describing I can imagine the required queries being very slow.

We would need to make updates to Kolibri to either paginate or otherwise segment the queries in order to improve the query performance here, but that would take a considerable amount of work, as the current architecture of the coach reports assumes availability of all this data all at once.

So, to accommodate the current load, can you please suggest something? Can we scale kolibri vertically? We have a load balancer with ec2 instances, can we run multiple kolibri instances that connect to the same database (is this possible?) ?

Also, after installing the kolibri-server, I still dont see the CPU utilization going higher than what it was before. So, I am not sure if kolibri-server is using multicore?

My guess is that the issue has very little to do with concurrent load on the server - I would imagine that even if you attempt to access the coach reports when there is no load on the server, it will take a very long time to load, simply because of the amount of data that is being queried at once on a per class basis.

Is this the case? Are you seeing any performance issues for end users accessing Kolibri, or only in attempting to monitor activity from the coach reports?

Both. The users are comlaining about the longer load times when they try to access their class and sometimes the webpage would crash when they try to access the “Library” tab when they login.

On the admin side, when we try to access the reports tab or try to open the classes (for any modification) it takes a very long time

Are you able to inspect the database at all to see how its resource utilization is? There may also be tooling to look at long running queries, which would be useful to be sure of where the bottlenecks are. I am not familiar with AWS enough to know where that tooling would be though.

Yes, we looked at the performance insights to see if DB was under load but the db looks healthy, the CPU <20% but some of the queries are taking longer

Hrm, none of the queries seem to be taking overly long from that list.

When you are attempting to load the coach reports page, if you open the Network tab in the browser developer tools, are any of the requests taking particularly long, consistently?

Likewise for loading the library page for learners under high load?

I am still a little confused why CPU utilization might not be maxing out - are you able to look at disk I/O to see if that might be causing a bottleneck somewhere?

We looked at some of the resource utilization on the kolibri server and it seems like the server is utilizing too much memory (about 80% of the memory). I want to try adding an external redis server to kolibri to see if we see any performance upgrades. I tried uncommenting the redis location and adding the redis endpoint but for some reason kolibri won’t start at all

This is my attempt to add cache server

[Cache]
CACHE_BACKEND = redis
CACHE_REDIS_MAXMEMORY_POLICY = allkeys-lru
CACHE_REDIS_MAXMEMORY = 100048077
# Which backend to use for the main cache - if ‘memory’ is selected, then for most cache operations,
# an in-memory, process-local cache will be used, but a disk based cache will be used for some data
# that needs to be persistent across processes. If ‘redis’ is used, it is used for all caches.
# CACHE_BACKEND = memory

# Default timeout for entries put into the cache.
# CACHE_TIMEOUT = 300

# Maximum number of entries to maintain in the cache at once.
# CACHE_MAX_ENTRIES = 1000

# Password to authenticate to Redis, Redis only.
# CACHE_PASSWORD =

# Host and port at which to connect to Redis, Redis only.
CACHE_LOCATION = redis-kolibri-server:6379

# The database number for Redis.
# CACHE_REDIS_DB = 0

# Maximum number of simultaneous connections to allow to Redis, Redis only.
# CACHE_REDIS_MAX_POOL_SIZE = 50

# How long to wait when trying to connect to Redis before timing out, Redis only.
# CACHE_REDIS_POOL_TIMEOUT = 30

# Maximum memory that Redis should use, Redis only.
# CACHE_REDIS_MAXMEMORY = 0

# Eviction policy to use when using Redis for caching, Redis only.
# CACHE_REDIS_MAXMEMORY_POLICY =

Also, since our db(postgresSQL) is external I tried setting up autoscaling (multiple ec2 instances running kolibri-server application with same options.ini configuration) but when I tried to login after the setup, I was redirected to login screen everytime I logged in.

Hrm - if Kolibri is not starting after these settings are applied, my only real guess would be that it’s not finding the redis server at set location, or is trying to connect and is unable to do so.

For the multiple instances, this is probably a result of Kolibri using a file backed session store by default.

You can see this in the default Kolibri settings here: kolibri/kolibri/deployment/default/settings/base.py at develop · learningequality/kolibri · GitHub

We don’t currently have a way in our options.ini to override this, so you would need to create a custom django settings file (which can import from the Kolibri default settings and then override this).

from kolibri.deployment.default.settings.base import *

SESSION_ENGINE = ...

You can find other options for the session engine here: How to use sessions | Django documentation | Django I can’t give a recommendation, as I haven’t tried any of them out - but we use database backed sessions on Kolibri Studio.

The reason this is the case is that with a file backed session in a scaling scenario the instance that processes your login may not be the instance that processes your next request - and if you ever send a request to a different instance it will invalidate your session when it doesn’t find it in its session store on disk - so you end up with the behaviour you are seeing.

We are still running into issues, where users are not able to access the content (sometimes kolibri crashes too). Same goes for the admin when they try to make any changes to the coach tab or try to access the reports (reports takes too long to load)

Can you please recommend an architecture based on our load?
Classes - 80 classes
Avg User per class - 400 - 600
Channels - 20 channels

Each class gets the same channel resources copied.

Current Architecture -
Ec2 instance running kolibri-server connecting to rds (postgresSQL) running behind a load balancer.
If we use EFS (network shared drive) can we autoscale then?