User Models and Profiles (building)

8 minute read

Updated:

  1. User Information Collection
    1.1 Explicit user information collection
    1.2 Implicit user information collection
    1.3 Techniques for implicit user information collection
  2. Step 1. User identification
  3. Step 2. User Model Construction
    3.1 Building keyword-based user model
    3.2 Building graph-based user model
    3.3 Building concept-based user model
  4. Summary

User Information Collection

Explicit user information collection

  • Information entered by the user, via HTML forms (self-report or self-assessment)
  • Data contains:
    • Demographics such as bday, marriage, hob, personal status
  • Explicit feedback
    • Rates some links on a page (e.g. Syskill \&Webert) it recommends other links which they might be interested in
Pros Cons
  • More reliable information

  • Easy to process

  • Complies with privacy regulations

  • User is in control of what info they give out

  • Users have to voluntarily provide information; otherwise no profile can be built

  • Requires time and willingness to contribute

  • Places additional burden on the user and can take long hence can lower user experience

  • User may be confused or biased in information they give

  • There may be information outside of what user knows

  • Dynamic changes can be missed – people change and you have to ask multiple times inaccuracy overtime

Implicit user information collection

  • Constructed based on implicitly collected information – implicit user feedback
  • Is collected by the system
    • Collected on user’s client machine or application server
  • Uses digital traces of UI – comments we write, news we read, items we click, video we watch
  • May add additional information about user device
  • user is not explicitly aware that we are collecting information
Pros Cons
  • Not obtrusive - Does not require any additional intervention by the user

  • Doesn’t require new software to be developed and installed

  • Rich data about the user

  • Gather information quickly

  • Not all personalised sites are used frequently enough by any single user to allow them to create a useful profile

Techniques for implicit user information collection

Technique Information collected Information breadth Pros Cons
Browser cache Browsing history Any Websites
  • Unobtrusive – user doesn’t have to install anything

  • Captures history

  • Objective/ authentic

  • Privacy invasion

  • Noisy (more than one user using the device)

Interaction/ desktop agents

Interaction and user activity

  • All the steps within the app

Any personalised application
  • All user files and activity available

  • Options they click on, resources they are opening

  • Requires user to install software

  • Investment in development of the software

Logs (web logs, search logs)

Browsing/ search activity

  • Pages user clicked

  • Keywords

Websites/ search engine sites that are logged
  • If you can link the word’s meanings to concept, it can be quite clever – knowledge graph of google (airport-travel)

  • Information about multiple users collected

  • Collection and use of information all at same time

  • May be very little information as it is from one site

  • Cookies must be turned on/ and or login to site

File transfer from one app to another

Previously stored information

  • Wishlist on YouTube/ Netflix

Application/ organisation specific
Has to be something that already exists so could not be much
Mobile/ wearable sensors Contextual such as GPS; physiological & psychological states Anywhere / anytime the user has the devices on
  • Various information and real time data

  • User has to have the device on

Emerging – speech/ comments on social media Sentiments, viewpoints, interests Social media platforms
  • Text, pitch, hmm kinda thing, pace, together with the language can be powerful

  • Language gives authoritative information / indication

  • Natural communication

  • Fairly robust technique – there are libraries that allow processing of speech reliably

  • Privacy and security

Email dictation – emails now reside somewhere in the server

  • Can disadvantage some people with handicap – need to offer another door

  • Some languages may not be well resourced

  • Environment restricted – noisy

Step 1. User identification

  • Once you decide the information collection method, who is the user? How do we identify the user we are building the model for?
  • Crucial for any system that constructs profiles that represent individual users

  • Methods for user identification
    • Software agents
      • Small programs that reside on the user’s computer
      • Collect information about the user and share with a server via some protocol
      • + 1st reliable - more control over the implementation and protocol used for identification
      • - requires user participation to install the software
    • Logins
      • + Better accuracy and consistency – track across sessions and btwn computers
      • + Can access information from different computers
      • + Knows who the user is and can control who they are
      • + With user consent consented
      • 2nd reliable
      • - user must create an account via registration, login and log out burden on user
    • Cookies

      • Easiest and most widely deployed – transparent to the user
      • - Poor accuracy due to multiple users – then it becomes privacy violation
      • - if user uses more than one computer, it will create separate user profile
      • - if user clears cookie its reset
    • Session IDs

      • Activity during the visit is tracked
      • + All the browsers are using it
      • + Good for searches – look at the session for short time and start recommending (adapting)
      • + Doesn’t violate privacy – no need to record bc you are only looking at the current session
      • - Not a long-term user model
    • Enhanced proxy servers
      • - Require that users register their machines with proxy server
      • - Generally only able to identify users connecting from only one location, unless they bother registering diff comps with same proxy

Step 2. User model construction

  • Next step of constructing a user model Need to think about techniques we are going to use
  1. We need to take input information about the user – data mining skills.
  2. Also need to take into account not only what comes in but what you are going to look in the user model (user model representation)– which part is related to my user model
    • E.g. modelling emotional state – what info I captured is related to this?
    • What’s the model, what comes in, which of the info will give me the final info model?
  3. Conduct appropriate processing – taking the info and need to derive processing to come up with a model
    • If the model is binary, e.g. if the user is active or inactive – this could become a classifier
    • If you are looking for several parameters – might need to do other processing,
      • One way is to overlay the user model – aggregating user model – looking at frequencies or inferences
  4. Extract the user model! Final outcome

Building keyword-based user model

  • Picture 1
  • Picture 12
  1. Initially created by extracting keywords from web pages collected from some information source (e.g. browsing history) - The user is browsing the web
  2. Through browsing agent, If the user is clicking the document, then you pull the document
  3. From these positive feedback documents, look at the text it has which is what user have possibly read
    • Positive feedback document – represents user’s interest
  4. From these documents, extract keywords of the document and weight them using TF*IDF (Term frequency inverse document frequency)

Input and output

  • Input - Unpacking the documents, which is what user has read
    • What we want – list of keywords k1,k2,k3,…

Steps

  1. TF - We unpack the documents then count the frequency for each word in each of the document
    • Some terms will only be in specialised documents more important!
  2. IDF - Inverse document frequency – count how many of these documents represent a particular term, for all terms in all documents
    • Which tells the weight of this term in the document space!
  3. TFIDF Multiply both to find TFIDF
    • Title and heading words are identified and weighed more highly - Term with highest TFIDF core terms. And we need smart ways of aggregating these core terms. i.e. based on similarities, based on overlap of the language - then there’s a user model !!

Building graph-based user model

  • image
  • Built by collecting explicit positive and negative feedback from users

  • Input: graph

  • What are the user interests from the documents that we pulled this is the POSTIVIE Example of user interests
    • We are reliant on having reliable enough method to identify that a document is a positive i.e. how long they stayed, if they’ve shared, etc
  • Reminder: entities to be the nodes, and we have relationships between the nodes. We want to extract this graph

  • Graph overlay – overlay the entities that users are interested in output
    • From the document, you need to be looking for concepts from the graph – world knowledge is usually given, so rather than counting the term, in here we look for concepts that are part of this graph need a diff approach !!
  • Approach we use is semantic tagging
    • There are libraries, and tools
    • Take world knowledge from the world model and map it, and go through textural documents, and identify which of the annotated tags/ concepts are mentioned in the document
  • Once semantic tagging is done, you have extracted in each document you are counting how often a particular concept has appeared.
  • Then you can decide on the overlay. This is the graph-based profile
  • First you need the graph as an input, and smartness is how you do the tagging
    • How? Looks through text and cuts it into words or phrases, (uni, bigrams) it maps to the graph. It may not be exact as what is in the graph, so we do approximate tagging. (similarities, synonyms, partial overlays)

Building concept-based user model

  • Nodes represent abstract topics considered interesting to the user, rather than specific words or set of words

First method

  • image

  • We take each document and do semantic tagging to get the overlay
  • From then on, you need to come up with aggregated list of concepts – look for top concepts. May overlay the graph that are sparse or big.
    • Might need to do pre-processing on the graph.
      • E.g. Common categories, most frequent concepts to come up with list based the counting on the graph

Second method

  • image

  • Identified positive documents, then based on that you need to identify what are the common things in these documents

    • Can cluster the documents the most similar documents, then from this group, then extract topics for each of the clusters then come up with the top concepts as ur user model
    • User modelling component is the red bits
    • But the input needs to be reliable !! the positive examples

Summary

  • User Information Collection
    • Explicit: given by the user
    • Implicit: monitoring what the user is doing, collected by the system
  • If we do implicit information collection:
    • Step 1: Identify the user
      • Depends on the data collection
    • Step 2: Construct the model
      • Keyword-based
      • Graph-based
      • Concept-based
      • We ned to think about how the model is represented and what the input data is.

Leave a comment