User Models and Profiles (building)

8 minute read

Updated: October 19, 2020

User Information Collection
1.1 Explicit user information collection
1.2 Implicit user information collection
1.3 Techniques for implicit user information collection
Step 1. User identification
Step 2. User Model Construction
3.1 Building keyword-based user model
3.2 Building graph-based user model
3.3 Building concept-based user model
Summary

User Information Collection

Explicit user information collection

Information entered by the user, via HTML forms (self-report or self-assessment)
Data contains:
- Demographics such as bday, marriage, hob, personal status
Explicit feedback
- Rates some links on a page (e.g. Syskill \&Webert) it recommends other links which they might be interested in

Pros	Cons
More reliable information Easy to process Complies with privacy regulations User is in control of what info they give out	Users have to voluntarily provide information; otherwise no profile can be built Requires time and willingness to contribute Places additional burden on the user and can take long hence can lower user experience User may be confused or biased in information they give There may be information outside of what user knows Dynamic changes can be missed – people change and you have to ask multiple times inaccuracy overtime

More reliable information
Easy to process
Complies with privacy regulations
User is in control of what info they give out

Users have to voluntarily provide information; otherwise no profile can be built
Requires time and willingness to contribute
Places additional burden on the user and can take long hence can lower user experience
User may be confused or biased in information they give
There may be information outside of what user knows
Dynamic changes can be missed – people change and you have to ask multiple times inaccuracy overtime

Implicit user information collection

Constructed based on implicitly collected information – implicit user feedback
Is collected by the system
- Collected on user’s client machine or application server
Uses digital traces of UI – comments we write, news we read, items we click, video we watch
May add additional information about user device
user is not explicitly aware that we are collecting information

Pros	Cons
Not obtrusive - Does not require any additional intervention by the user Doesn’t require new software to be developed and installed Rich data about the user Gather information quickly	Not all personalised sites are used frequently enough by any single user to allow them to create a useful profile

Techniques for implicit user information collection

Technique	Information collected	Information breadth	Pros	Cons
Browser cache	Browsing history	Any Websites	Unobtrusive – user doesn’t have to install anything Captures history Objective/ authentic	Privacy invasion Noisy (more than one user using the device)
Interaction/ desktop agents	Interaction and user activity All the steps within the app	Any personalised application	All user files and activity available Options they click on, resources they are opening	Requires user to install software Investment in development of the software
Logs (web logs, search logs)	Browsing/ search activity Pages user clicked Keywords	Websites/ search engine sites that are logged	If you can link the word’s meanings to concept, it can be quite clever – knowledge graph of google (airport-travel) Information about multiple users collected Collection and use of information all at same time	May be very little information as it is from one site Cookies must be turned on/ and or login to site
File transfer from one app to another	Previously stored information Wishlist on YouTube/ Netflix	Application/ organisation specific		Has to be something that already exists so could not be much
Mobile/ wearable sensors	Contextual such as GPS; physiological & psychological states	Anywhere / anytime the user has the devices on	Various information and real time data	User has to have the device on
Emerging – speech/ comments on social media	Sentiments, viewpoints, interests	Social media platforms	Text, pitch, hmm kinda thing, pace, together with the language can be powerful Language gives authoritative information / indication Natural communication Fairly robust technique – there are libraries that allow processing of speech reliably	Privacy and security Email dictation – emails now reside somewhere in the server Can disadvantage some people with handicap – need to offer another door Some languages may not be well resourced Environment restricted – noisy

Step 1. User identification

Once you decide the information collection method, who is the user? How do we identify the user we are building the model for?
Crucial for any system that constructs profiles that represent individual users
Methods for user identification
- Software agents
  - Small programs that reside on the user’s computer
  - Collect information about the user and share with a server via some protocol
  - + 1^st reliable - more control over the implementation and protocol used for identification
  - - requires user participation to install the software
- Logins
  - + Better accuracy and consistency – track across sessions and btwn computers
  - + Can access information from different computers
  - + Knows who the user is and can control who they are
  - + With user consent consented
  - 2^nd reliable
  - - user must create an account via registration, login and log out burden on user
- Cookies
  - Easiest and most widely deployed – transparent to the user
  - - Poor accuracy due to multiple users – then it becomes privacy violation
  - - if user uses more than one computer, it will create separate user profile
  - - if user clears cookie its reset
- Session IDs
  - Activity during the visit is tracked
  - + All the browsers are using it
  - + Good for searches – look at the session for short time and start recommending (adapting)
  - + Doesn’t violate privacy – no need to record bc you are only looking at the current session
  - - Not a long-term user model
- Enhanced proxy servers
  - - Require that users register their machines with proxy server
  - - Generally only able to identify users connecting from only one location, unless they bother registering diff comps with same proxy

Step 2. User model construction

Next step of constructing a user model Need to think about techniques we are going to use

We need to take input information about the user – data mining skills.
Also need to take into account not only what comes in but what you are going to look in the user model (user model representation)– which part is related to my user model
- E.g. modelling emotional state – what info I captured is related to this?
- What’s the model, what comes in, which of the info will give me the final info model?
Conduct appropriate processing – taking the info and need to derive processing to come up with a model
- If the model is binary, e.g. if the user is active or inactive – this could become a classifier
- If you are looking for several parameters – might need to do other processing,
  - One way is to overlay the user model – aggregating user model – looking at frequencies or inferences
Extract the user model! Final outcome

Building keyword-based user model

Initially created by extracting keywords from web pages collected from some information source (e.g. browsing history) - The user is browsing the web
Through browsing agent, If the user is clicking the document, then you pull the document
From these positive feedback documents, look at the text it has which is what user have possibly read
- Positive feedback document – represents user’s interest
From these documents, extract keywords of the document and weight them using TF*IDF (Term frequency inverse document frequency)

Input and output

Input - Unpacking the documents, which is what user has read
- What we want – list of keywords k1,k2,k3,…

Steps

TF - We unpack the documents then count the frequency for each word in each of the document
- Some terms will only be in specialised documents more important!
IDF - Inverse document frequency – count how many of these documents represent a particular term, for all terms in all documents
- Which tells the weight of this term in the document space!
TFIDF Multiply both to find TFIDF
- Title and heading words are identified and weighed more highly - Term with highest TFIDF core terms. And we need smart ways of aggregating these core terms. i.e. based on similarities, based on overlap of the language - then there’s a user model !!

Building graph-based user model

Built by collecting explicit positive and negative feedback from users
Input: graph
What are the user interests from the documents that we pulled this is the POSTIVIE Example of user interests
- We are reliant on having reliable enough method to identify that a document is a positive i.e. how long they stayed, if they’ve shared, etc
Reminder: entities to be the nodes, and we have relationships between the nodes. We want to extract this graph
Graph overlay – overlay the entities that users are interested in output
- From the document, you need to be looking for concepts from the graph – world knowledge is usually given, so rather than counting the term, in here we look for concepts that are part of this graph need a diff approach !!
Approach we use is semantic tagging
- There are libraries, and tools
- Take world knowledge from the world model and map it, and go through textural documents, and identify which of the annotated tags/ concepts are mentioned in the document
Once semantic tagging is done, you have extracted in each document you are counting how often a particular concept has appeared.
Then you can decide on the overlay. This is the graph-based profile
First you need the graph as an input, and smartness is how you do the tagging
- How? Looks through text and cuts it into words or phrases, (uni, bigrams) it maps to the graph. It may not be exact as what is in the graph, so we do approximate tagging. (similarities, synonyms, partial overlays)

Building concept-based user model

Nodes represent abstract topics considered interesting to the user, rather than specific words or set of words

First method

We take each document and do semantic tagging to get the overlay
From then on, you need to come up with aggregated list of concepts – look for top concepts. May overlay the graph that are sparse or big.
- Might need to do pre-processing on the graph.
  - E.g. Common categories, most frequent concepts to come up with list based the counting on the graph

Second method

Identified positive documents, then based on that you need to identify what are the common things in these documents
- Can cluster the documents the most similar documents, then from this group, then extract topics for each of the clusters then come up with the top concepts as ur user model
- User modelling component is the red bits
- But the input needs to be reliable !! the positive examples

Summary

User Information Collection
- Explicit: given by the user
- Implicit: monitoring what the user is doing, collected by the system
If we do implicit information collection:
- Step 1: Identify the user
  - Depends on the data collection
- Step 2: Construct the model
  - Keyword-based
  - Graph-based
  - Concept-based
  - We ned to think about how the model is represented and what the input data is.

User Information Collection

Explicit user information collection

Implicit user information collection

Techniques for implicit user information collection

Step 1. User identification

Step 2. User model construction

Building keyword-based user model

Building graph-based user model

Building concept-based user model

Summary

Leave a comment